pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Katy Voor	fe7d1d7d0e	Add LeakyReLU operator to static runtime (#47798 ) Summary: - Add LeakyReLU operator to static runtime - Add LeakyReLU benchmark - Add LeakyReLU correctness test case Static Runtime ``` ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_leaky_relu/1 4092 ns 4092 ns 172331 BM_leaky_relu/8 4425 ns 4425 ns 158434 BM_leaky_relu/20 4830 ns 4830 ns 145335 BM_leaky_relu_const/1 3545 ns 3545 ns 198054 BM_leaky_relu_const/8 3825 ns 3825 ns 183074 BM_leaky_relu_const/20 4222 ns 4222 ns 165999 ``` Interpreter ``` ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_leaky_relu/1 7183 ns 7182 ns 96377 BM_leaky_relu/8 7580 ns 7580 ns 91588 BM_leaky_relu/20 8066 ns 8066 ns 87183 BM_leaky_relu_const/1 6466 ns 6466 ns 107925 BM_leaky_relu_const/8 7063 ns 7063 ns 98768 BM_leaky_relu_const/20 7380 ns 7380 ns 94564 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798 Reviewed By: ezyang Differential Revision: D24927043 Pulled By: kavoor fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b	2020-11-13 22:05:52 -08:00
Yang Wang	0125e14c9a	[OpBench] change relu entry point after D24747035 Summary: D24747035 (`1478e5ec2a`) removes the entry point of `nnq.functional.relu`. Adjust op benchmark to `torch.nn.ReLU` accordingly. Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit --iterations 1 --warmup_iterations 1 Reviewed By: mingzhe09088 Differential Revision: D24961625 fbshipit-source-id: 5ed0ec7fa6d8cfefc8e7fc8324cf9a2a3e59de90	2020-11-13 15:38:27 -08:00
Richard Zou	d4db4718fa	Revert D24873991: Profiler benchmark fix Test Plan: revert-hammer Differential Revision: D24873991 (`a97c7e2ef0`) Original commit changeset: 1c3950d7d289 fbshipit-source-id: 6f3b8a49caf90aaa3e16707005b6b7cf6e61d89f	2020-11-13 08:37:14 -08:00
Yang Wang	9ee4f499f0	[OpBench] add _consume_op.list for processing input with type of List[Tensor] (#47890 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47890 As titled. In order to fix issue when running `chunk_test`, `split_test`, `qobserver` , `sort in qunary` in jit mode, because the output of `chunk_op` is a list of tensors which can not be handled by the current `_consume_op` Test Plan: OSS: python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit Reviewed By: mingzhe09088 Differential Revision: D24774105 fbshipit-source-id: 210a0345b8526ebf3c24f4d0794e20b2ff6cef3d	2020-11-12 23:29:40 -08:00
Ilia Cherniavskii	a97c7e2ef0	Profiler benchmark fix (#47713 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47713 Fix the import and also always use internal Timer Test Plan: python benchmarks/profiler_benchmark/profiler_bench.py Reviewed By: dzhulgakov Differential Revision: D24873991 Pulled By: ilia-cher fbshipit-source-id: 1c3950d7d289a4fb5bd7043ba2d842a35c263eaa	2020-11-12 21:47:30 -08:00
Yang Wang	8ff0b6fef8	[OpBenchMobile] Enable operator_benchmark to run the benchmark on mobile through AiBench (#47767 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767 This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps: 1. create a scripted module from existing benchmark case. 2. run mobile specific optimization pass on the scripted module 3. run the scripted module on AiBench by calling its Python API A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation). Test Plan: ## local op_bench run buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 --use_jit Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version ``` RuntimeError: Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function. The error stack is reproduced here: Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript: File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260 quant_min: int, quant_max: int ): return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE : File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313 axis: int, quant_min: int, quant_max: int ): return self.op_func(input, scale, zero_point, axis, quant_min, quant_max) ~~~~~~~~~~~~ <--- HERE ``` `_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105 ## OSS test python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 ## saved module graph ``` module __torch__.mobile_benchmark_utils.OpBenchmarkMobile { parameters { } attributes { training = True num_iters = 1 benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50> } methods { method forward { graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile): %12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4 %4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 %1 : int = prim::GetAttr[name="num_iters"](%self) = prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 block0(%i : int): %6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() %9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple) %23 : int = prim::Constant[value=1]() %24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 -> (%4) return (%12) } } submodules { module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark { parameters { } attributes { mobile_optimized = True } methods { method forward { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark, %input_one.1 : Tensor, %input_two.1 : Tensor): %3 : int = prim::Constant[value=1]() %4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 return (%4) } method get_inputs { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark): %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() return (%self.inputs_tuple) } } submodules { } } } } ``` Reviewed By: kimishpatel Differential Revision: D24322214 fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1	2020-11-12 17:15:05 -08:00
Meng Wang	f692af209d	add unittest for operator benchmark (#47678 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47678 add unittest for operator benchmark. Covers below cases: ``` generate_c2_test generate_c2_gradient_test generate_pt_test generate_pt_gradient_test generate_pt_tests_from_op_list ``` Also fixed two issues (incorrect fn signature) found by the unittest in `benchmark_caffe2.py` Test Plan: arc lint buck run caffe2/benchmarks/operator_benchmark:operator_benchmark_unittest ``` test_c2_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1109 23:08:39.932207 639464 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.474 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.281 ok test_pt_list_of_ops (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.579 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.734 # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M8 # Input: M: 8 Forward Execution Time (us) : 148.929 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M8 # Input: M: 8 Forward Execution Time (us) : 71.909 ok test_pt_single_op (operator_benchmark_unittest.BenchmarkTest) ... # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 36.860 # Benchmarking Caffe2: add # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 42.293 # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M8 # Input: M: 8 Forward Execution Time (us) : 148.999 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M8 # Input: M: 8 Forward Execution Time (us) : 71.941 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8 # Input: M: 8 Forward Execution Time (us) : 179.108 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8 # Input: M: 8 Backward Execution Time (us) : 1205.902 ok ``` buck run caffe2/benchmarks/operator_benchmark/c2:add_test ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1109 23:20:11.551795 654290 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8_N16_K32_dtypeint # Input: M: 8, N: 16, K: 32, dtype: int Forward Execution Time (us) : 984.510 # Benchmarking Caffe2: add # Name: add_M16_N16_K64_dtypefloat # Input: M: 16, N: 16, K: 64, dtype: float Forward Execution Time (us) : 68.526 # Benchmarking Caffe2: add # Name: add_M64_N64_K128_dtypeint # Input: M: 64, N: 64, K: 128, dtype: int Forward Execution Time (us) : 101617.076 ``` Reviewed By: mingzhe09088 Differential Revision: D24854414 fbshipit-source-id: 6676549909da6700b42f322c4ad6e8e2ef5b86b5	2020-11-10 15:45:36 -08:00
Radhakrishnan Venkataramani	163adb9fa7	Add HalfToFloat + FloatToHalf operators to PyTorch (#45092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092 Adding two operators 1. at::float_to_half -> Converts FP32 tensor to FP16 tensor 2. at::half_to_float -> Converts FP16 tensor to FP32 tensor. These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath. Test Plan: buck test //caffe2/test:torch -- .test_half_tensor. Run benchmark locally using ``` buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test ``` AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days. Benchmark for 512x512 tensor with FbGeMM implementation ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1246.332 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 1734.304 ``` Benchmark for 512x512 tensor trunk with no FbGeMM integration. ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark # Mode: Eager # Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 169045.724 # Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark # Mode: Eager # Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 152382.494 ``` Reviewed By: ngimel Differential Revision: D23824869 fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c	2020-11-10 12:00:53 -08:00
Shijun Kong	220b3bd667	Add op benchmark for batch box cox as baseline (#47275 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47275 ``` # Benchmarking Caffe2: batch_box_cox # Name: batch_box_cox_M64_N64_dtypedouble # Input: M: 64, N: 64, dtype: double Forward Execution Time (us) : 49.005 ``` Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:batch_box_cox_test -- --iterations=1000 --warmup 100` Reviewed By: houseroad Differential Revision: D24675426 fbshipit-source-id: 8bb1f3076dc6b01e7b63468136ddf3d9b6d7e5d2	2020-11-05 07:16:32 -08:00
Supriya Rao	d8c3b2b10c	[quant][pyper] Add support for pruned weights in embedding_bag_byte lookup (#47329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47329 Supports pruned weights along with mapping for the compressed indices Test Plan: python test/test_quantization.py TestQuantizedEmbeddingOps Imported from OSS Reviewed By: qizzzh Differential Revision: D24719909 fbshipit-source-id: f998f4039e84bbe1886e492a3bff6aa5f56b6b0f	2020-11-04 22:33:33 -08:00
Hao Lu	996f444c00	[pt][static_runtime] Memory model (#46896 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896 The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants. As a result, I have to make the following adjustments: 1) remove tensors in output Tuples from internal blob list; 2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning; 3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage Risk: PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk. https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23 Test Plan: ``` buck test //caffe2/test:static_runtime buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test ``` Benchmarks: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \ buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \ --scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \ --pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \ --iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \ --pt_cleanup_activations=true --pt_enable_out_variant=false ``` \|pt_cleanup_activations \|pt_enable_out_variant \|old ms/iter \|new ms/iter \| \|--- \|--- \|--- \|--- \| \|0 \|0 \|0.31873 \|0.30228 \| \|0 \|1 \|0.30018 \|0.29184 \| \|1 \|0 \|0.35246 \|0.31895 \| \|1 \|1 \|0.35742 \|0.30417 \| Reviewed By: bwasti, raziel Differential Revision: D24471854 fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c	2020-11-03 23:47:59 -08:00
Sheng Qin	c9222b7471	Implement clip_ranges operator for PyTorch Test Plan: unit test for correctness ``` buck test caffe2/torch/fb/sparsenn:test -- test_clip_ranges Parsing buck files: finished in 1.6 sec Creating action graph: finished in 18.9 sec Building: finished in 15.0 sec (100%) 9442/9442 jobs, 1 updated Total time: 35.6 sec More details at https://www.internalfb.com/intern/buck/build/66fb17de-859e-4d01-89bf-5c5de2950693 Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details. Running with tpx session id: 80f5e0c2-7db2-48a4-b148-25dd34651682 Trace available for this run at /tmp/tpx-20201026-123217.050766/trace.log Started reporting to test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422 ✓ ListingSuccess: caffe2/torch/fb/sparsenn:test - main (14.912) ✓ Pass: caffe2/torch/fb/sparsenn:test - test_clip_ranges (caffe2.torch.fb.sparsenn.tests.sparsenn_operators_test.SparseNNOperatorsTest) (14.098) Summary Pass: 1 ListingSuccess: 1 Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4503599665041422 ``` new benchmark perf test ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 155.765 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 156.248 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 156.634 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 155.408 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 165.168 ``` Compare with the old implementation, there are around 300us gain ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 443.012 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 446.480 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 444.064 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 445.511 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 450.468 ``` Reviewed By: MarcioPorto Differential Revision: D24546110 fbshipit-source-id: e6c9b38e911f177f97961ede5bf375107f240363	2020-10-28 09:46:37 -07:00
Sheng Qin	c6858fd71a	Set up benchmarks for ClipRanges operator for Caffe2 and PyTorch Summary: As title, adding the benchmark tests for ClipRanges operators. Test Plan: benchmark test for Caffe2 ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: clip_ranges WARNING: Logging before InitGoogleLogging() is written to STDERR W1026 12:30:33.938997 2658759 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypeint32 # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: int32 Forward Execution Time (us) : 5.805 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypeint32 # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: int32 Forward Execution Time (us) : 5.913 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypeint32 # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: int32 Forward Execution Time (us) : 5.941 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypeint32 # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: int32 Forward Execution Time (us) : 5.868 # Benchmarking Caffe2: clip_ranges # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypeint32 # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: int32 Forward Execution Time (us) : 6.408 ``` benchmark test for PyTorch ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH6_M1_N2_MAX_LENGTH1_dtypetorch.int32_cpu # Input: LENGTH: 6, M: 1, N: 2, MAX_LENGTH: 1, dtype: torch.int32, device: cpu Forward Execution Time (us) : 443.012 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH7_M1_N2_MAX_LENGTH2_dtypetorch.int32_cpu # Input: LENGTH: 7, M: 1, N: 2, MAX_LENGTH: 2, dtype: torch.int32, device: cpu Forward Execution Time (us) : 446.480 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH8_M1_N2_MAX_LENGTH3_dtypetorch.int32_cpu # Input: LENGTH: 8, M: 1, N: 2, MAX_LENGTH: 3, dtype: torch.int32, device: cpu Forward Execution Time (us) : 444.064 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH9_M1_N2_MAX_LENGTH4_dtypetorch.int32_cpu # Input: LENGTH: 9, M: 1, N: 2, MAX_LENGTH: 4, dtype: torch.int32, device: cpu Forward Execution Time (us) : 445.511 # Benchmarking PyTorch: clip_ranges # Mode: JIT # Name: clip_ranges_LENGTH10_M1_N2_MAX_LENGTH5_dtypetorch.int32_cpu # Input: LENGTH: 10, M: 1, N: 2, MAX_LENGTH: 5, dtype: torch.int32, device: cpu Forward Execution Time (us) : 450.468 ``` Reviewed By: MarcioPorto Differential Revision: D24500468 fbshipit-source-id: a582090a3982005af272cb10cdd257b2b2e787c4	2020-10-28 09:42:10 -07:00
Shijun Kong	d5cd781cd3	Update dper3 to use torch.nan_to_num and nan_to_num_ (#46873 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46873 OSS: Add op benchmark for torch.nan_to_num and torch.nan_to_num_ Test Plan: OSS: `buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:nan_to_num_test` Reviewed By: qizzzh, houseroad Differential Revision: D24521835 fbshipit-source-id: 1fd50a99e5329ffec2d470525ce6976d39424958	2020-10-27 06:41:48 -07:00
shmsong	56a3831bc6	[NVFuser]Benchmark minor update (#46778 ) Summary: This is a tiny PR for two minor fixes: 1. Added `torch._C._jit_set_texpr_fuser_enabled(False)` to enable shape inference on nv fuser runs. 2. Renamed dynamic benchmark module names to avoid multiple matching. i.e. `simple_element` with `dynamic_simple_element`. I guess it'd be much easier if the pattern matching was based on `startswith`. Would be happy to update that if agreed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46778 Reviewed By: zhangguanheng66 Differential Revision: D24516911 Pulled By: bertmaher fbshipit-source-id: 839f9a3e058f9d7aca17b2e6eb8b558e0e48e8f4	2020-10-26 12:22:36 -07:00
Shijun Kong	6ae0a7c919	Add ReplaceNaN benchmark as baseline (#46685 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46685 as title Test Plan: caffe2 ``` ./buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/replace_nan_test.par # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: replace_nan WARNING: Logging before InitGoogleLogging() is written to STDERR W1022 10:09:48.508246 1887813 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: replace_nan_M16_N16_dtypefloat # Input: M: 16, N: 16, dtype: float Forward Execution Time (us) : 30.742 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M16_N16_dtypedouble # Input: M: 16, N: 16, dtype: double Forward Execution Time (us) : 29.135 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M64_N64_dtypefloat # Input: M: 64, N: 64, dtype: float Forward Execution Time (us) : 94.059 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M64_N64_dtypedouble # Input: M: 64, N: 64, dtype: double Forward Execution Time (us) : 93.569 ``` Reviewed By: qizzzh, houseroad Differential Revision: D24448483 fbshipit-source-id: 51574ca0eca6dba5828dfdc754193dba5a62954f	2020-10-22 19:12:14 -07:00
Yang Wang	920ec6651f	[OpBench] fix jit mode run of operator benchmark for ops with parameters (#46694 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694 For the op with parameters (e.g. conv), the jit mode run currently will raise an error of `RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module` Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par --use_jit Reviewed By: mingzhe09088 Differential Revision: D24451206 fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6	2020-10-22 11:10:28 -07:00
Shijun Kong	e5a2ba2ea1	Fix benchmark_caffe2 Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only. Test Plan: `buck run caffe2/benchmarks/operator_benchmark/c2:add_test` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8_N16_K32_dtypeint # Input: M: 8, N: 16, K: 32, dtype: int Forward Execution Time (us) : 652.748 # Benchmarking Caffe2: add # Name: add_M16_N16_K64_dtypefloat # Input: M: 16, N: 16, K: 64, dtype: float Forward Execution Time (us) : 63.570 # Benchmarking Caffe2: add # Name: add_M64_N64_K128_dtypeint # Input: M: 64, N: 64, K: 128, dtype: in ``` Reviewed By: qizzzh Differential Revision: D24448374 fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6	2020-10-22 08:09:06 -07:00
Mingzhe Li	8908f6ad8e	[op-bench] modify import path of configs (#46679 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679 Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue. ghstack-source-id: 114857978 Test Plan: waitforsandcastle Reviewed By: vkuzo Differential Revision: D24459631 fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5	2020-10-21 16:15:11 -07:00
Hao Lu	1a3ea46dbf	[StaticRuntime] Threading model (#46219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219 - Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`: ``` struct InferenceModule { explicit InferenceModule(const torch::jit::Module& m); explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g); torch::jit::Module module; std::shared_ptr<torch::jit::Graph> graph; std::unique_ptr<c10::FunctionSchema> schema; std::unordered_map<Value*, size_t> value_to_reg; std::vector<size_t> input_regs; // inputs to the graph std::vector<size_t> output_regs; // outputs of the graph std::vector<size_t> internals; }; ``` which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime: ``` mutable std::vector<IValue> reg_; // The nodes we need to run std::vector<ProcessedNode> nodes_; ``` `reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`. - Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example. - Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable. This diff focuses on threading model only. Benchmarks will be separate. Reviewed By: bwasti Differential Revision: D24237078 fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1	2020-10-20 14:37:30 -07:00
Mikhail Zolotukhin	e5ed037529	[StaticRuntime] Add a 'speed of light' benchmark. (#46308 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308 This PR adds a hand optimized version of DeepAndWide model with the goal of estimating overheads of static runtime. While static runtime is currently much faster than the existing JIT interpreter, it would be useful to understand how close we are to an absolutely 0-overhead system. Currently, this "ideal" implementation is 2x faster than the static runtime on batchsize=1. Full benchmark results: ``` Running build/bin/static_runtime_bench Run on (24 X 2394.71 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 4096K (x24) L3 Unified 16384K (x24) ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_deep_wide_base/1 59518 ns 59500 ns 10909 BM_deep_wide_base/8 74635 ns 74632 ns 9317 BM_deep_wide_base/20 82186 ns 82147 ns 9119 BM_deep_wide_fast/1 13851 ns 13851 ns 49825 << new BM_deep_wide_fast/8 22497 ns 22497 ns 32089 << new BM_deep_wide_fast/20 23868 ns 23841 ns 31184 << new BM_deep_wide_jit_graph_executor/1 62786 ns 62786 ns 10835 BM_deep_wide_jit_graph_executor/8 76730 ns 76718 ns 7529 BM_deep_wide_jit_graph_executor/20 78886 ns 78883 ns 8769 BM_deep_wide_jit_profiling_executor/1 69504 ns 69490 ns 10309 BM_deep_wide_jit_profiling_executor/8 75718 ns 75715 ns 9199 BM_deep_wide_jit_profiling_executor/20 75364 ns 75364 ns 9010 BM_deep_wide_static/1 40324 ns 40318 ns 17232 BM_deep_wide_static/8 50327 ns 50319 ns 13335 BM_deep_wide_static/20 53075 ns 53071 ns 12855 BM_deep_wide_static_threaded/threads:8 6258 ns 49873 ns 14008 ``` PS: The implementation could probably be optimized even more. Differential Revision: D24300702 Test Plan: Imported from OSS Reviewed By: dzhulgakov Pulled By: ZolotukhinM fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e	2020-10-19 23:35:55 -07:00
Bugra Akyildiz	03c7d5be6b	Add operator benchmark for 4bit/8bit embedding lookups Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`. Test Plan: ``` buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test' ```` The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474 https://www.internalfb.com/intern/aibench/details/346463246 Reviewed By: radkris-git Differential Revision: D24268413 fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d	2020-10-15 13:51:32 -07:00
Bert Maher	b7261de0df	[pytorch][te] Add compilation time benchmark (#46124 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124 We want to make sure we can actually fuse kernels within a fairly tight time budget. So here's a quick benchmark of codegen for a simple pointwise activation function (swish). I kept all the intermediate tensors separate to force TE to actually do inlining. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench ``` I've only run in debug mode so results aren't super meaningful, but even in that mode it's 18ms for compilation, 15 of which are in llvm. Update, opt build mode: ``` ---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- BM_CompileSwish 5123276 ns 5119846 ns 148 BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160 ``` Reviewed By: asuhan Differential Revision: D24232801 fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76	2020-10-09 23:11:37 -07:00
shmsong	43fe45ab0f	[JIT] Add dynamic shape benchmark for NV Fuser (#46107 ) Summary: This PR modifies `benchmarks/tensorexpr`. It follows up[ https://github.com/pytorch/pytorch/issues/44101](https://github.com/pytorch/pytorch/pull/44101) and further supports characterizing fusers with dynamic shape benchmarks. Dynamic shape condition models the use case when the input tensor shape changes in each call to the graph. Changes include: Added an auxiliary class `DynamicShape `that provides a simple API for enabling dynamic shapes in existing test cases, example can be found with `DynamicSimpleElementBench` Created new bench_cls: `DynamicSimpleElementBench`, `DynamicReduce2DInnerBench`, `DynamicReduce2DOuterBench`, and `DynamicLSTM`. They are all dynamic shaped versions of existing benchmarks and examples of enabling dynamic shape with `DynamicShape`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46107 Reviewed By: glaringlee Differential Revision: D24229400 Pulled By: bertmaher fbshipit-source-id: 889fece5ea87d0f6f6374d31dbe11b1cd1380683	2020-10-09 22:09:21 -07:00
Supriya Rao	31888b2e77	[quant][pyper] Rename the sparse argument for embedding_bag ops (#46003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003 sparse is confusing because itt is used in training for sparse gradients Test Plan: Imported from OSS Reviewed By: radkris-git, qizzzh Differential Revision: D24178248 fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d	2020-10-08 16:15:28 -07:00
Shijun Kong	7d4f5060ad	Fix doc about operator benchmark (#45853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853 The method name in README is not consistent with actual implementation. Reviewed By: qizzzh Differential Revision: D24114849 fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883	2020-10-08 09:13:53 -07:00
Bert Maher	f2e569461b	[te] Tiled (m=32 x n=32) gemm benchmark (#45905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142402 Pulled By: bertmaher fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f	2020-10-06 16:57:31 -07:00
Bert Maher	50f89578dd	[te] Add a benchmark harness (#45875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875 Adds a googlebenchmark harness for perf testing programs generated by tensorexpr, sans any pytorch wrappings (for python-level benchmarks of tensorexpr, see benchmarks/tensorexpr). Currently there's a harness for gemm that sets up the problem using torch (and also measures the perf of a torch::mm to give a baseline). Right now there's just an unoptimized implementation that is expected to be not very fast. More optimized versions are coming. Sample output from my dev box: ``` Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------- Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s ``` Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142403 Pulled By: bertmaher fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597	2020-10-06 16:57:27 -07:00
Mingzhe Li	e829d4fba9	[op-bench] fix jit mode (#45774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45774 Fix RuntimeError: No such operator operator_benchmark::_consume Test Plan: waitforsandcastle Reviewed By: ngimel Differential Revision: D24064982 fbshipit-source-id: 13160b6d18569e659ca1ab0ca1d444ed9947260c	2020-10-05 09:29:41 -07:00
Hao Lu	2b48dd168d	[StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640 Reviewed By: dzhulgakov Differential Revision: D23996656 fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5	2020-10-02 23:03:05 -07:00
Ilia Cherniavskii	f5c95d5cf1	Source code level attribution in profiler (#43898 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898 Adding with_source parameter to enable tracking source code (filename and line) in profiler for eager, torchscript and autograd modes Test Plan: python test/test_profiler.py ``` Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Source Location ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- -------------------------------------------- ts_method_1 10.43% 235.364us 36.46% 822.920us 822.920us 1 test/test_profiler.py(70): test_source aten::add 7.52% 169.833us 8.88% 200.439us 200.439us 1 test/test_profiler.py(69): test_source aten::normal_ 6.26% 141.380us 6.26% 141.380us 141.380us 1 test/test_profiler.py(67): test_source aten::add 5.80% 130.830us 8.41% 189.800us 63.267us 3 test/test_profiler.py(72): test_source aten::sum 5.02% 113.340us 8.39% 189.475us 189.475us 1 test/test_profiler.py(64): ts_method_1 aten::add 4.58% 103.346us 6.33% 142.847us 142.847us 1 test/test_profiler.py(62): ts_method_1 aten::mul 4.05% 91.498us 9.62% 217.113us 217.113us 1 test/test_profiler.py(71): test_source aten::add 4.03% 90.880us 5.60% 126.405us 126.405us 1 test/test_profiler.py(58): ts_method_2 aten::empty 3.49% 78.735us 3.49% 78.735us 19.684us 4 test/test_profiler.py(72): test_source ``` Reviewed By: ngimel Differential Revision: D23432664 Pulled By: ilia-cher fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615	2020-09-30 00:57:35 -07:00
Taylor Robie	ccad73ab41	Fix D23995953 import. Summary: https://github.com/pytorch/pytorch/pull/45511 could not be properly imported Test Plan: See https://github.com/pytorch/pytorch/pull/45511 Reviewed By: zhangguanheng66 Differential Revision: D23995953 fbshipit-source-id: a6224a67d54617ddf34c2392e65f2142c4e78ea4	2020-09-29 19:30:23 -07:00
Bram Wasti	87b356d093	[static runtime] Split out graph preparation from runtime (#44131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604305 Pulled By: bwasti fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6	2020-09-28 13:01:23 -07:00
Mikhail Zolotukhin	bc5710f2f7	Benchmarks: tweak PE config settings. (#45349 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349 Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D23935518 Pulled By: ZolotukhinM fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc	2020-09-26 23:13:29 -07:00
Mikhail Zolotukhin	8cef7326f4	Benchmarks: add 'default' options for fuser and executor. (#45347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347 Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D23935519 Pulled By: ZolotukhinM fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff	2020-09-26 23:09:02 -07:00
Bram Wasti	e5f6e5af13	Add Deep and wide to test and flatten/tranpose for good measure (#44129 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604302 Pulled By: bwasti fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12	2020-09-25 11:05:41 -07:00
Bram Wasti	d1a11618f5	[static runtime] Add _out variants and reuse memory (#44128 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604304 Pulled By: bwasti fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9	2020-09-25 11:03:06 -07:00
anjali411	58b6ab69e5	torch.sgn for complex tensors (#39955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955 resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors. `torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0` This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460526 Pulled By: anjali411 fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92	2020-09-22 08:24:53 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Kevin Stephano	26a91a9f04	[WIP][JIT] Add benchmarking support of NV Fuser with FP16 dtype support (#44101 ) Summary: Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler. This support has some modifications besides adding an option to support the NVIDIA fuser: * Adds FP16 Datatype support * Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes * Adds IR printing and kernel printing knobs * Adds a knob `input_iter` to create ranges of inputs currently only for reductions * Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob. * Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise and reduction operations in the most minimal fashion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101 Reviewed By: ngimel Differential Revision: D23713658 Pulled By: bertmaher fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1	2020-09-15 15:10:49 -07:00
Mikhail Zolotukhin	37093f4d99	Benchmarks: make fuser and executor configurable from command line. (#44291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44291 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23569089 Pulled By: ZolotukhinM fbshipit-source-id: ec25b2f0bba303adaa46c3e85b1a9ce4fa3cf076	2020-09-09 11:59:35 -07:00
Mikhail Zolotukhin	6134ac17ba	Revert D23561500: Benchmarks: re-enable profiling-te configuration (try 2). Test Plan: revert-hammer Differential Revision: D23561500 (`589a2024c8`) Original commit changeset: 7fe86d34afa4 fbshipit-source-id: 10e48f230402572fcece56662ad4413ac0bd3cb5	2020-09-07 19:10:30 -07:00
Mikhail Zolotukhin	589a2024c8	Benchmarks: re-enable profiling-te configuration (try 2). (#44270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270 The previous PR (#44212) was reverted since I didn't update the `upload_scribe.py` script and it was looking for 'executor_and_fuser' field in the json which now is replaced with two separate fields: 'executor' and 'fuser'. Differential Revision: D23561500 Test Plan: Imported from OSS Reviewed By: ngimel Pulled By: ZolotukhinM fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470	2020-09-07 15:50:39 -07:00
Natalia Gimelshein	626e410e1d	Revert D23544563: Benchmarks: re-enable profiling-te configuration. Test Plan: revert-hammer Differential Revision: D23544563 (`ac1f471fe2`) Original commit changeset: 98659e8860fa fbshipit-source-id: 5dab7044699f59c709e64d178758f5f462ebb788	2020-09-06 21:01:19 -07:00
Mikhail Zolotukhin	ac1f471fe2	Benchmarks: re-enable profiling-te configuration. (#44212 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44212 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23544563 Pulled By: ZolotukhinM fbshipit-source-id: 98659e8860fa951d142e0f393731c4a769463c6c	2020-09-06 10:22:16 -07:00
Mikhail Zolotukhin	d0421ff1cc	Benchmarks: add scripts for FastRNNs results comparison. (#44134 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44134 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23505810 Pulled By: ZolotukhinM fbshipit-source-id: d0b3d70d4c2a44a8c3773631d09a25a98ec59370	2020-09-03 13:44:42 -07:00
Mikhail Zolotukhin	d11603de38	[TensorExpr] Benchmarks: set number of profiling runs to 2 for PE. (#44112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44112 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23500904 Pulled By: ZolotukhinM fbshipit-source-id: d0dd54752b7ea5ae11f33e865c96d2d61e98d573	2020-09-03 11:29:35 -07:00
Bert Maher	33d51a9b32	Respect canFuseOn{CPU,GPU} in TE fuser (#43967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43967 Test Plan: Imported from OSS Reviewed By: asuhan Differential Revision: D23469048 Pulled By: bertmaher fbshipit-source-id: 1005a7ae08974059ff9d467492caa3a388070eeb	2020-09-02 18:00:25 -07:00
taivu	8722952dbd	Add benchmark for channel_shuffle operator (#43509 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509 Test Plan: Imported from OSS Reviewed By: kimishpatel Differential Revision: D23299972 Pulled By: kimishpatel fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754	2020-09-02 08:15:19 -07:00
Bram Wasti	6512032699	[Static Runtime] Add OSS build for static runtime benchmarks (#43881 ) Summary: Adds CMake option. Build with: ``` BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/43881 Reviewed By: hlu1 Differential Revision: D23430708 Pulled By: bwasti fbshipit-source-id: a39bf54e8d4d044a4a3e4273a5b9a887daa033ec	2020-09-02 08:00:18 -07:00

1 2 3 4 5 ...

332 Commits