Commit Graph

332 Commits

Author SHA1 Message Date
Katy Voor
fe7d1d7d0e Add LeakyReLU operator to static runtime (#47798)
Summary:
- Add LeakyReLU operator to static runtime
- Add LeakyReLU benchmark
- Add LeakyReLU correctness test case

Static Runtime
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              4092 ns       4092 ns     172331
BM_leaky_relu/8                              4425 ns       4425 ns     158434
BM_leaky_relu/20                             4830 ns       4830 ns     145335
BM_leaky_relu_const/1                        3545 ns       3545 ns     198054
BM_leaky_relu_const/8                        3825 ns       3825 ns     183074
BM_leaky_relu_const/20                       4222 ns       4222 ns     165999
```

Interpreter
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              7183 ns       7182 ns      96377
BM_leaky_relu/8                              7580 ns       7580 ns      91588
BM_leaky_relu/20                             8066 ns       8066 ns      87183
BM_leaky_relu_const/1                        6466 ns       6466 ns     107925
BM_leaky_relu_const/8                        7063 ns       7063 ns      98768
BM_leaky_relu_const/20                       7380 ns       7380 ns      94564
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798

Reviewed By: ezyang

Differential Revision: D24927043

Pulled By: kavoor

fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b
2020-11-13 22:05:52 -08:00
Yang Wang
0125e14c9a [OpBench] change relu entry point after D24747035
Summary: D24747035 (1478e5ec2a) removes the entry point of `nnq.functional.relu`. Adjust op benchmark to `torch.nn.ReLU` accordingly.

Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit  --iterations 1 --warmup_iterations 1

Reviewed By: mingzhe09088

Differential Revision: D24961625

fbshipit-source-id: 5ed0ec7fa6d8cfefc8e7fc8324cf9a2a3e59de90
2020-11-13 15:38:27 -08:00
Richard Zou
d4db4718fa Revert D24873991: Profiler benchmark fix
Test Plan: revert-hammer

Differential Revision:
D24873991 (a97c7e2ef0)

Original commit changeset: 1c3950d7d289

fbshipit-source-id: 6f3b8a49caf90aaa3e16707005b6b7cf6e61d89f
2020-11-13 08:37:14 -08:00
Yang Wang
9ee4f499f0 [OpBench] add _consume_op.list for processing input with type of List[Tensor] (#47890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47890

As titled. In order to fix issue when running `chunk_test`, `split_test`, `qobserver` , `sort in qunary` in jit mode, because the output of `chunk_op` is a list of tensors which can not be handled by the current `_consume_op`

Test Plan:
OSS:
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit

Reviewed By: mingzhe09088

Differential Revision: D24774105

fbshipit-source-id: 210a0345b8526ebf3c24f4d0794e20b2ff6cef3d
2020-11-12 23:29:40 -08:00
Ilia Cherniavskii
a97c7e2ef0 Profiler benchmark fix (#47713)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47713

Fix the import and also always use internal Timer

Test Plan: python benchmarks/profiler_benchmark/profiler_bench.py

Reviewed By: dzhulgakov

Differential Revision: D24873991

Pulled By: ilia-cher

fbshipit-source-id: 1c3950d7d289a4fb5bd7043ba2d842a35c263eaa
2020-11-12 21:47:30 -08:00
Yang Wang
8ff0b6fef8 [OpBenchMobile] Enable operator_benchmark to run the benchmark on mobile through AiBench (#47767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767

This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps:

1. create a scripted module from existing benchmark case.
2. run mobile specific optimization pass on the scripted module
3. run the scripted module on AiBench by calling its Python API

A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation).

Test Plan:
## local op_bench run

buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test --  --iterations 1 --warmup_iterations 1

buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test --  --iterations 1 --warmup_iterations 1 --use_jit

Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version

```
RuntimeError:
Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function.
The error stack is reproduced here:

Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript:
  File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260
    quant_min: int, quant_max: int
):
    return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
:
  File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313
        axis: int, quant_min: int, quant_max: int
    ):
        return self.op_func(input, scale, zero_point, axis, quant_min, quant_max)
               ~~~~~~~~~~~~ <--- HERE
```

`_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105

## OSS test

python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1

## saved module graph
```
module __torch__.mobile_benchmark_utils.OpBenchmarkMobile {
  parameters {
  }
  attributes {
    training = True
    num_iters = 1
    benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50>
  }
  methods {
    method forward {
      graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile):
        %12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4
        %4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
        %1 : int = prim::GetAttr[name="num_iters"](%self)
         = prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
          block0(%i : int):
            %6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
            %7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
            %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
            %9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple)
            %23 : int = prim::Constant[value=1]()
            %24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
            -> (%4)
        return (%12)

    }
  }
  submodules {
    module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark {
      parameters {
      }
      attributes {
        mobile_optimized = True
      }
      methods {
        method forward {
          graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark,
                %input_one.1 : Tensor,
                %input_two.1 : Tensor):
            %3 : int = prim::Constant[value=1]()
            %4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
            return (%4)

        }
        method get_inputs {
          graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark):
            %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
            return (%self.inputs_tuple)

        }
      }
      submodules {
      }
    }
  }
}

```

Reviewed By: kimishpatel

Differential Revision: D24322214

fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1
2020-11-12 17:15:05 -08:00
Meng Wang
f692af209d add unittest for operator benchmark (#47678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47678

add unittest for operator benchmark.
Covers below cases:
```
generate_c2_test
generate_c2_gradient_test
generate_pt_test
generate_pt_gradient_test
generate_pt_tests_from_op_list
```
Also fixed two issues (incorrect fn signature) found by the unittest in `benchmark_caffe2.py`

Test Plan:
arc lint
buck run caffe2/benchmarks/operator_benchmark:operator_benchmark_unittest
```
test_c2_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1109 23:08:39.932207 639464 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.474

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.281

ok
test_pt_list_of_ops (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.579

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.734

# Benchmarking PyTorch: abs
# Mode: Eager
# Name: abs_M8
# Input: M: 8
Forward Execution Time (us) : 148.929

# Benchmarking PyTorch: abs_
# Mode: Eager
# Name: abs__M8
# Input: M: 8
Forward Execution Time (us) : 71.909

ok
test_pt_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 36.860

# Benchmarking Caffe2: add
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 42.293

# Benchmarking PyTorch: abs
# Mode: Eager
# Name: abs_M8
# Input: M: 8
Forward Execution Time (us) : 148.999

# Benchmarking PyTorch: abs_
# Mode: Eager
# Name: abs__M8
# Input: M: 8
Forward Execution Time (us) : 71.941

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M8
# Input: M: 8
Forward Execution Time (us) : 179.108

# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M8
# Input: M: 8
Backward Execution Time (us) : 1205.902

ok
```
buck run caffe2/benchmarks/operator_benchmark/c2:add_test
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1109 23:20:11.551795 654290 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 984.510

# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 68.526

# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: int
Forward Execution Time (us) : 101617.076
```

Reviewed By: mingzhe09088

Differential Revision: D24854414

fbshipit-source-id: 6676549909da6700b42f322c4ad6e8e2ef5b86b5
2020-11-10 15:45:36 -08:00
Radhakrishnan Venkataramani
163adb9fa7 Add HalfToFloat + FloatToHalf operators to PyTorch (#45092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092

Adding two operators
1. at::float_to_half -> Converts FP32 tensor to FP16 tensor
2. at::half_to_float -> Converts FP16 tensor to FP32 tensor.

These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath.

Test Plan:
buck test //caffe2/test:torch -- .*test_half_tensor.*

Run benchmark locally using

```
buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test
```

AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days.

Benchmark for 512x512 tensor with FbGeMM implementation

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1246.332

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1734.304
```

Benchmark for 512x512 tensor trunk with no FbGeMM integration.

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 169045.724

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 152382.494
```

Reviewed By: ngimel

Differential Revision: D23824869

fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c
2020-11-10 12:00:53 -08:00
Shijun Kong
220b3bd667 Add op benchmark for batch box cox as baseline (#47275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47275

```
# Benchmarking Caffe2: batch_box_cox
# Name: batch_box_cox_M64_N64_dtypedouble
# Input: M: 64, N: 64, dtype: double
Forward Execution Time (us) : 49.005
```

Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:batch_box_cox_test -- --iterations=1000  --warmup 100`

Reviewed By: houseroad

Differential Revision: D24675426

fbshipit-source-id: 8bb1f3076dc6b01e7b63468136ddf3d9b6d7e5d2
2020-11-05 07:16:32 -08:00
Supriya Rao
d8c3b2b10c [quant][pyper] Add support for pruned weights in embedding_bag_byte lookup (#47329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47329

Supports pruned weights along with mapping for the compressed indices

Test Plan:
python test/test_quantization.py TestQuantizedEmbeddingOps

Imported from OSS

Reviewed By: qizzzh

Differential Revision: D24719909

fbshipit-source-id: f998f4039e84bbe1886e492a3bff6aa5f56b6b0f
2020-11-04 22:33:33 -08:00
Hao Lu
996f444c00 [pt][static_runtime] Memory model (#46896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896

The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants.

As a result, I have to make the following adjustments:
1) remove tensors in output Tuples from internal blob list;
2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning;
3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage

Risk:
PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk.

https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Benchmarks:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=false
```

|pt_cleanup_activations	|pt_enable_out_variant	|old ms/iter	|new ms/iter	|
|---	|---	|---	|---	|
|0	|0	|0.31873	|0.30228	|
|0	|1	|0.30018	|0.29184	|
|1	|0	|0.35246	|0.31895	|
|1	|1	|0.35742	|0.30417	|

Reviewed By: bwasti, raziel

Differential Revision: D24471854

fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
2020-11-03 23:47:59 -08:00
Sheng Qin
c9222b7471 Implement clip_ranges operator for PyTorch
Test Plan:
unit test for correctness
```
buck test caffe2/torch/fb/sparsenn:test -- test_clip_ranges
Parsing buck files: finished in 1.6 sec
Creating action graph: finished in 18.9 sec
Building: finished in 15.0 sec (100%) 9442/9442 jobs, 1 updated
  Total time: 35.6 sec
More details at https://www.internalfb.com/intern/buck/build/66fb17de-859e-4d01-89bf-5c5de2950693
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 80f5e0c2-7db2-48a4-b148-25dd34651682
Trace available for this run at /tmp/tpx-20201026-123217.050766/trace.log
Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422
    ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (14.912)
    ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (14.098)
Summary
  Pass: 1
  ListingSuccess: 1
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422
```

new  benchmark perf test
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 155.765

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 156.248

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 156.634

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 155.408

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 165.168
```

Compare with the old implementation, there are **around 300us gain**
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 443.012

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 446.480

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 444.064

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 445.511

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 450.468
```

Reviewed By: MarcioPorto

Differential Revision: D24546110

fbshipit-source-id: e6c9b38e911f177f97961ede5bf375107f240363
2020-10-28 09:46:37 -07:00
Sheng Qin
c6858fd71a Set up benchmarks for ClipRanges operator for Caffe2 and PyTorch
Summary: As title, adding the benchmark tests for ClipRanges operators.

Test Plan:
benchmark test for Caffe2
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: clip_ranges
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1026 12:30:33.938997 2658759 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypeint32
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: int32
Forward Execution Time (us) : 5.805

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypeint32
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: int32
Forward Execution Time (us) : 5.913

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypeint32
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: int32
Forward Execution Time (us) : 5.941

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypeint32
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: int32
Forward Execution Time (us) : 5.868

# Benchmarking Caffe2: clip_ranges
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypeint32
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: int32
Forward Execution Time (us) : 6.408
```

benchmark test for PyTorch
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu
# Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 443.012

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu
# Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 446.480

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu
# Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 444.064

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu
# Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 445.511

# Benchmarking PyTorch: clip_ranges
# Mode: JIT
# Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu
# Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu
Forward Execution Time (us) : 450.468
```

Reviewed By: MarcioPorto

Differential Revision: D24500468

fbshipit-source-id: a582090a3982005af272cb10cdd257b2b2e787c4
2020-10-28 09:42:10 -07:00
Shijun Kong
d5cd781cd3 Update dper3 to use torch.nan_to_num and nan_to_num_ (#46873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46873

OSS:
Add op benchmark for torch.nan_to_num and torch.nan_to_num_

Test Plan:
OSS:
`buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:nan_to_num_test`

Reviewed By: qizzzh, houseroad

Differential Revision: D24521835

fbshipit-source-id: 1fd50a99e5329ffec2d470525ce6976d39424958
2020-10-27 06:41:48 -07:00
shmsong
56a3831bc6 [NVFuser]Benchmark minor update (#46778)
Summary:
This is a tiny PR for two minor fixes:

1. Added `torch._C._jit_set_texpr_fuser_enabled(False)` to enable shape inference on nv fuser runs.
2. Renamed dynamic benchmark module names to avoid multiple matching. i.e. `simple_element` with `dynamic_simple_element`. I guess it'd be much easier if the pattern matching was based on `startswith`. Would be happy to update that if agreed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46778

Reviewed By: zhangguanheng66

Differential Revision: D24516911

Pulled By: bertmaher

fbshipit-source-id: 839f9a3e058f9d7aca17b2e6eb8b558e0e48e8f4
2020-10-26 12:22:36 -07:00
Shijun Kong
6ae0a7c919 Add ReplaceNaN benchmark as baseline (#46685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46685

as title

Test Plan:
caffe2

```
./buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/replace_nan_test.par

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: replace_nan
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1022 10:09:48.508246 1887813 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: replace_nan_M16_N16_dtypefloat
# Input: M: 16, N: 16, dtype: float
Forward Execution Time (us) : 30.742

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M16_N16_dtypedouble
# Input: M: 16, N: 16, dtype: double
Forward Execution Time (us) : 29.135

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypefloat
# Input: M: 64, N: 64, dtype: float
Forward Execution Time (us) : 94.059

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypedouble
# Input: M: 64, N: 64, dtype: double
Forward Execution Time (us) : 93.569
```

Reviewed By: qizzzh, houseroad

Differential Revision: D24448483

fbshipit-source-id: 51574ca0eca6dba5828dfdc754193dba5a62954f
2020-10-22 19:12:14 -07:00
Yang Wang
920ec6651f [OpBench] fix jit mode run of operator benchmark for ops with parameters (#46694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694

For the op with parameters (e.g. conv), the jit mode run currently will raise an error of
`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module`

Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par  --use_jit

Reviewed By: mingzhe09088

Differential Revision: D24451206

fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6
2020-10-22 11:10:28 -07:00
Shijun Kong
e5a2ba2ea1 Fix benchmark_caffe2
Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only.

Test Plan:
`buck run caffe2/benchmarks/operator_benchmark/c2:add_test`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 652.748

# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 63.570

# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: in
```

Reviewed By: qizzzh

Differential Revision: D24448374

fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6
2020-10-22 08:09:06 -07:00
Mingzhe Li
8908f6ad8e [op-bench] modify import path of configs (#46679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679

Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue.
ghstack-source-id: 114857978

Test Plan: waitforsandcastle

Reviewed By: vkuzo

Differential Revision: D24459631

fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5
2020-10-21 16:15:11 -07:00
Hao Lu
1a3ea46dbf [StaticRuntime] Threading model (#46219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219

- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
  explicit InferenceModule(const torch::jit::Module& m);
  explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
  torch::jit::Module module;
  std::shared_ptr<torch::jit::Graph> graph;
  std::unique_ptr<c10::FunctionSchema> schema;

  std::unordered_map<Value*, size_t> value_to_reg;
  std::vector<size_t> input_regs; // inputs to the graph
  std::vector<size_t> output_regs; // outputs of the graph
  std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
  mutable std::vector<IValue> reg_;
  // The nodes we need to run
  std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.

- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.

- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.

This diff focuses on threading model only. Benchmarks will be separate.

Reviewed By: bwasti

Differential Revision: D24237078

fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
2020-10-20 14:37:30 -07:00
Mikhail Zolotukhin
e5ed037529 [StaticRuntime] Add a 'speed of light' benchmark. (#46308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308

This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.

Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1                         59518 ns      59500 ns      10909
BM_deep_wide_base/8                         74635 ns      74632 ns       9317
BM_deep_wide_base/20                        82186 ns      82147 ns       9119
BM_deep_wide_fast/1                         13851 ns      13851 ns      49825 << new
BM_deep_wide_fast/8                         22497 ns      22497 ns      32089 << new
BM_deep_wide_fast/20                        23868 ns      23841 ns      31184 << new
BM_deep_wide_jit_graph_executor/1           62786 ns      62786 ns      10835
BM_deep_wide_jit_graph_executor/8           76730 ns      76718 ns       7529
BM_deep_wide_jit_graph_executor/20          78886 ns      78883 ns       8769
BM_deep_wide_jit_profiling_executor/1       69504 ns      69490 ns      10309
BM_deep_wide_jit_profiling_executor/8       75718 ns      75715 ns       9199
BM_deep_wide_jit_profiling_executor/20      75364 ns      75364 ns       9010
BM_deep_wide_static/1                       40324 ns      40318 ns      17232
BM_deep_wide_static/8                       50327 ns      50319 ns      13335
BM_deep_wide_static/20                      53075 ns      53071 ns      12855
BM_deep_wide_static_threaded/threads:8       6258 ns      49873 ns      14008
```

PS: The implementation could probably be optimized even more.

Differential Revision: D24300702

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Pulled By: ZolotukhinM

fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
2020-10-19 23:35:55 -07:00
Bugra Akyildiz
03c7d5be6b Add operator benchmark for 4bit/8bit embedding lookups
Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`.

Test Plan:
```
buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test
aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test'
````

The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474
https://www.internalfb.com/intern/aibench/details/346463246

Reviewed By: radkris-git

Differential Revision: D24268413

fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d
2020-10-15 13:51:32 -07:00
Bert Maher
b7261de0df [pytorch][te] Add compilation time benchmark (#46124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124

We want to make sure we can actually fuse kernels within a fairly
tight time budget.  So here's a quick benchmark of codegen for a simple
pointwise activation function (swish).  I kept all the intermediate tensors
separate to force TE to actually do inlining.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```

I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.

Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish                         5123276 ns    5119846 ns        148
BM_CompileSwishLLVMOnly                 4754361 ns    4753701 ns        160
```

Reviewed By: asuhan

Differential Revision: D24232801

fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
2020-10-09 23:11:37 -07:00
shmsong
43fe45ab0f [JIT] Add dynamic shape benchmark for NV Fuser (#46107)
Summary:
This PR modifies `benchmarks/tensorexpr`. It follows up[ https://github.com/pytorch/pytorch/issues/44101](https://github.com/pytorch/pytorch/pull/44101) and further supports characterizing fusers with dynamic shape benchmarks. Dynamic shape condition models the use case when the input tensor shape changes in each call to the graph.

Changes include:

Added an auxiliary class `DynamicShape `that provides a simple API for enabling dynamic shapes in existing test cases, example can be found with `DynamicSimpleElementBench`

Created new bench_cls: `DynamicSimpleElementBench`, `DynamicReduce2DInnerBench`, `DynamicReduce2DOuterBench`, and `DynamicLSTM`. They are all dynamic shaped versions of existing benchmarks and examples of enabling dynamic shape with `DynamicShape`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46107

Reviewed By: glaringlee

Differential Revision: D24229400

Pulled By: bertmaher

fbshipit-source-id: 889fece5ea87d0f6f6374d31dbe11b1cd1380683
2020-10-09 22:09:21 -07:00
Supriya Rao
31888b2e77 [quant][pyper] Rename the sparse argument for embedding_bag ops (#46003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003

sparse is confusing because itt is used in training for sparse gradients

Test Plan: Imported from OSS

Reviewed By: radkris-git, qizzzh

Differential Revision: D24178248

fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
2020-10-08 16:15:28 -07:00
Shijun Kong
7d4f5060ad Fix doc about operator benchmark (#45853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853

The method name in README is not consistent with actual implementation.

Reviewed By: qizzzh

Differential Revision: D24114849

fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883
2020-10-08 09:13:53 -07:00
Bert Maher
f2e569461b [te] Tiled (m=32 x n=32) gemm benchmark (#45905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142402

Pulled By: bertmaher

fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f
2020-10-06 16:57:31 -07:00
Bert Maher
50f89578dd [te] Add a benchmark harness (#45875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875

Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).

Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).

Right now there's just an unoptimized implementation that is expected to be not
very fast.  More optimized versions are coming.

Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128                    73405 ns      73403 ns       8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128        3073003 ns    3072808 ns        229 GFLOPS=1.36497G/s
```

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142403

Pulled By: bertmaher

fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597
2020-10-06 16:57:27 -07:00
Mingzhe Li
e829d4fba9 [op-bench] fix jit mode (#45774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45774

Fix RuntimeError: No such operator operator_benchmark::_consume

Test Plan: waitforsandcastle

Reviewed By: ngimel

Differential Revision: D24064982

fbshipit-source-id: 13160b6d18569e659ca1ab0ca1d444ed9947260c
2020-10-05 09:29:41 -07:00
Hao Lu
2b48dd168d [StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640

Reviewed By: dzhulgakov

Differential Revision: D23996656

fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5
2020-10-02 23:03:05 -07:00
Ilia Cherniavskii
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
Taylor Robie
ccad73ab41 Fix D23995953 import.
Summary: https://github.com/pytorch/pytorch/pull/45511 could not be properly imported

Test Plan: See https://github.com/pytorch/pytorch/pull/45511

Reviewed By: zhangguanheng66

Differential Revision: D23995953

fbshipit-source-id: a6224a67d54617ddf34c2392e65f2142c4e78ea4
2020-09-29 19:30:23 -07:00
Bram Wasti
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
Mikhail Zolotukhin
bc5710f2f7 Benchmarks: tweak PE config settings. (#45349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935518

Pulled By: ZolotukhinM

fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc
2020-09-26 23:13:29 -07:00
Mikhail Zolotukhin
8cef7326f4 Benchmarks: add 'default' options for fuser and executor. (#45347)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935519

Pulled By: ZolotukhinM

fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff
2020-09-26 23:09:02 -07:00
Bram Wasti
e5f6e5af13 Add Deep and wide to test and flatten/tranpose for good measure (#44129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604302

Pulled By: bwasti

fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12
2020-09-25 11:05:41 -07:00
Bram Wasti
d1a11618f5 [static runtime] Add _out variants and reuse memory (#44128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604304

Pulled By: bwasti

fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9
2020-09-25 11:03:06 -07:00
anjali411
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Kevin Stephano
26a91a9f04 [WIP][JIT] Add benchmarking support of NV Fuser with FP16 dtype support (#44101)
Summary:
Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler.

This support has some modifications besides adding an option to support the NVIDIA fuser:

* Adds FP16 Datatype support
* Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes
* Adds IR printing and kernel printing knobs
* Adds a knob `input_iter` to create ranges of inputs currently only for reductions
* Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob.
* Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise  and reduction operations in the most minimal fashion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101

Reviewed By: ngimel

Differential Revision: D23713658

Pulled By: bertmaher

fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1
2020-09-15 15:10:49 -07:00
Mikhail Zolotukhin
37093f4d99 Benchmarks: make fuser and executor configurable from command line. (#44291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44291

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23569089

Pulled By: ZolotukhinM

fbshipit-source-id: ec25b2f0bba303adaa46c3e85b1a9ce4fa3cf076
2020-09-09 11:59:35 -07:00
Mikhail Zolotukhin
6134ac17ba Revert D23561500: Benchmarks: re-enable profiling-te configuration (try 2).
Test Plan: revert-hammer

Differential Revision:
D23561500 (589a2024c8)

Original commit changeset: 7fe86d34afa4

fbshipit-source-id: 10e48f230402572fcece56662ad4413ac0bd3cb5
2020-09-07 19:10:30 -07:00
Mikhail Zolotukhin
589a2024c8 Benchmarks: re-enable profiling-te configuration (try 2). (#44270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270

The previous PR (#44212) was reverted since I didn't update the
`upload_scribe.py` script and it was looking for 'executor_and_fuser'
field in the json which now is replaced with two separate fields:
'executor' and 'fuser'.

Differential Revision: D23561500

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470
2020-09-07 15:50:39 -07:00
Natalia Gimelshein
626e410e1d Revert D23544563: Benchmarks: re-enable profiling-te configuration.
Test Plan: revert-hammer

Differential Revision:
D23544563 (ac1f471fe2)

Original commit changeset: 98659e8860fa

fbshipit-source-id: 5dab7044699f59c709e64d178758f5f462ebb788
2020-09-06 21:01:19 -07:00
Mikhail Zolotukhin
ac1f471fe2 Benchmarks: re-enable profiling-te configuration. (#44212)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44212

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23544563

Pulled By: ZolotukhinM

fbshipit-source-id: 98659e8860fa951d142e0f393731c4a769463c6c
2020-09-06 10:22:16 -07:00
Mikhail Zolotukhin
d0421ff1cc Benchmarks: add scripts for FastRNNs results comparison. (#44134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44134

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23505810

Pulled By: ZolotukhinM

fbshipit-source-id: d0b3d70d4c2a44a8c3773631d09a25a98ec59370
2020-09-03 13:44:42 -07:00
Mikhail Zolotukhin
d11603de38 [TensorExpr] Benchmarks: set number of profiling runs to 2 for PE. (#44112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44112

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23500904

Pulled By: ZolotukhinM

fbshipit-source-id: d0dd54752b7ea5ae11f33e865c96d2d61e98d573
2020-09-03 11:29:35 -07:00
Bert Maher
33d51a9b32 Respect canFuseOn{CPU,GPU} in TE fuser (#43967)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43967

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D23469048

Pulled By: bertmaher

fbshipit-source-id: 1005a7ae08974059ff9d467492caa3a388070eeb
2020-09-02 18:00:25 -07:00
taivu
8722952dbd Add benchmark for channel_shuffle operator (#43509)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23299972

Pulled By: kimishpatel

fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754
2020-09-02 08:15:19 -07:00
Bram Wasti
6512032699 [Static Runtime] Add OSS build for static runtime benchmarks (#43881)
Summary:
Adds CMake option.  Build with:

```
BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43881

Reviewed By: hlu1

Differential Revision: D23430708

Pulled By: bwasti

fbshipit-source-id: a39bf54e8d4d044a4a3e4273a5b9a887daa033ec
2020-09-02 08:00:18 -07:00