pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit `7763c83af6`. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
sanchitintel	8852bb561c	More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367 ) ### Summary In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review. At the time, landing that PR asap seemed essential, so I agreed to roll-back that change, In some cases, more threads can be used than are being used with the current approach. <strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>. On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR. I've also added op-level benchmarks pertaining to example input shapes in this PR. ### Benchmarks Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids) One socket of 48 physical cores was used, with & without HyperThreading. Intel OpenMP & tcmalloc were preloaded. Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones - `KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all` #### Already existing benchmarks \|Benchmark name (dim is 1, by default) \| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup Percentage = (old-new)100/old \| Speedup ratio (old/new)\| \|-------------\|--------\|-------\|----------------------------\|----------\| \|Softmax_N1_C3_H256_W256_cpu\|31.364\|11.594\|63.03% \|2.705\| \|Softmax_N4_C3_H256_W256_cpu\|34.475\|24.966\| 27.58%\|1.380\| \|Softmax_N8_C3_H512_W256_cpu\|94.044\|78.372\|16.66%\|1.199\| \|Softmax2d_N8_C3_H512_W256_cpu\|100.195\|79.529\|20.62%\|1.259\| #### Some of the following benchmarks are being added in this PR \|Benchmark name\| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup percentage = (old-new)100/old\| Speedup ratio (old/new) \| \|-------------\|--------\|-------\|----------------------------\|--------------------\| \|LogSoftmax_M128_N128_dim1_cpu\|7.629\|6.475\|15.12%\| 1.178\| \|LogSoftmax_M48_N128_dim1_cpu\|6.848\|5.969\|12.83%\| 1.147\| \|LogSoftmax_M16_N1024_dim1_cpu\|7.004\|6.322\|9.73%\| 1.107\| \|LogSoftmax_M32_N1024_dim1_cpu\|7.037\|6.558\|6.80%\| 1.073\| \|LogSoftmax_M48_N1024_dim1_cpu\|7.155\|6.773\|5.33%\|1.056\| \|LogSoftmax_M16_N512_dim1_cpu\|6.797\|5.862\|13.75%\|1.159\| \|LogSoftmax_M32_N512_dim1_cpu\|7.223\|6.202\|14.13%\|1.164\| \|LogSoftmax_M48_N512_dim1_cpu\|7.159\|6.301\|11.98%\|1.136\| \|LogSoftmax_M16_N256_dim1_cpu\|6.842\|5.682\|16.95%\|1.204\| \|LogSoftmax_M32_N256_dim1_cpu\|6.840\|6.086\|11.02%\|1.123\| \|LogSoftmax_M48_N256_dim1_cpu\|7.005\|6.031\|13.94%\|1.161\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-17 02:26:29 +00:00
Edward Z. Yang	dd3a77bc96	Apply UFMT to all files in benchmarks/ (#105928 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928 Approved by: https://github.com/albanD	2023-07-26 01:18:48 +00:00
sanchitintel	c4544bc169	Fix thread-allocation in `_vec_log_softmax_lastdim` (#85398 ) ## Problem history There seems to always have been a bug in `_vec_log_softmax_lastdim `. In particular, there were two issues with it - #### Bug 1 Before AVX512 support was added, `CHUNK_SIZE` had been heuristically chosen in `_vec_log_softmax_lastdim`: `CHUNK_SIZE = (128 / sizeof(scalar_t)) * Vec::size();` It was `256` for float32, bfloat16, and float16. When AVX512 support was added, `CHUNK_SIZE` became `512`. The rationale behind determining `CHUNK_SIZE` has not been described, and seems flawed, since the number of OpenMP threads used currently depends upon it. #### Bug 2 `grain_size` had been defined as `internal::GRAIN_SIZE / (16 * dim_size * CHUNK_SIZE)` So, `grain_size` was usually 0, as it was `8 / (dim_size)`, so, it's always replaced by `CHUNK_SIZE`, viz. 256. Since `256` was always the `grain_size` for `at::parallel_for`, few threads were used in certain cases. #### Problem caused by bugs With `outer_size` of say, 700, only 3 threads would have been used with AVX2, irrespective of the value of `dim_size`! When AVX512 support was added, since `CHUNK_SIZE` became `512`, only 2 threads were used if `outer_dim` was 700. In the Transformers training example, `log_softmax` was computed on the last dim of a tensor of shape `(700, 23258)`. AVX512 thus appeared to be quite slower, cloaking the actual issue that even AVX2 performance for the kernel was quite poor due to inefficient work distribution amongst OpenMP threads. ## Solution Distribute work more efficiently, which would result in higher performance for both AVX2 & AVX512 than now, and fixes the regression observed with AVX512 (AVX512 kernel would now be faster than its AVX2 counterpart). ## Benchmarks ##### Machine-config: Intel(R) Xeon(R) Platinum 8371HC CPU (Cooper Lake) One socket of 26 physical cores was used. Intel OpenMP & tcmalloc were preloaded. Example of a command to run benchmark: `ATEN_CPU_CAPABILITY=avx512 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 MKL_NUM_THREADS=26 OMP_NUM_THREADS=26 numactl --membind=0 --cpunodebind=0 python3.8 -m pt.softmax_test --test_name LogSoftmax_N1024_seq_len23258_dim1_cpu` Benchmark \| Old implementation time (us) \| New implementation time (us) \| Speedup ratio (old/new) -- \| -- \| -- \| -- LogSoftmax_N1024_seq_len23258_dim1_cpu AVX2 \| 11069.281 \| 2651.186 \| 4.17x LogSoftmax_N1024_seq_len23258_dim1_cpu AVX512 \| 18292.928 \| 2586.550\| 7.07x LogSoftmax_N700_seq_len23258_dim1_cpu AVX2 \| 9611.902 \| 1762.833 \| 5.452x LogSoftmax_N700_seq_len23258_dim1_cpu AVX512 \| 12168.371 \| 1717.824 \| 7.08x Pull Request resolved: https://github.com/pytorch/pytorch/pull/85398 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/peterbell10, https://github.com/lezcano	2023-02-07 15:09:05 +00:00
Yang Wang	8ff0b6fef8	[OpBenchMobile] Enable operator_benchmark to run the benchmark on mobile through AiBench (#47767 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767 This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps: 1. create a scripted module from existing benchmark case. 2. run mobile specific optimization pass on the scripted module 3. run the scripted module on AiBench by calling its Python API A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation). Test Plan: ## local op_bench run buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 --use_jit Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version ``` RuntimeError: Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function. The error stack is reproduced here: Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript: File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260 quant_min: int, quant_max: int ): return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE : File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313 axis: int, quant_min: int, quant_max: int ): return self.op_func(input, scale, zero_point, axis, quant_min, quant_max) ~~~~~~~~~~~~ <--- HERE ``` `_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105 ## OSS test python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 ## saved module graph ``` module __torch__.mobile_benchmark_utils.OpBenchmarkMobile { parameters { } attributes { training = True num_iters = 1 benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50> } methods { method forward { graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile): %12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4 %4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 %1 : int = prim::GetAttr[name="num_iters"](%self) = prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8 block0(%i : int): %6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self) %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() %9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple) %23 : int = prim::Constant[value=1]() %24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 -> (%4) return (%12) } } submodules { module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark { parameters { } attributes { mobile_optimized = True } methods { method forward { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark, %input_one.1 : Tensor, %input_two.1 : Tensor): %3 : int = prim::Constant[value=1]() %4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15 return (%4) } method get_inputs { graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark): %self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]() return (%self.inputs_tuple) } } submodules { } } } } ``` Reviewed By: kimishpatel Differential Revision: D24322214 fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1	2020-11-12 17:15:05 -08:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Mingzhe Li	af3468a1c7	change op bench input shape to reduce execution time (#29616 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29616 1. Reduce the predefined_min_time which is the minimum time each test needs to run. Based on the test result, the average time across different epoch are pretty stable before exiting. So we can safely reduce the predefined time here. 2. Chang the input shapes of several ops Test Plan: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add 200 256.044864655 400 165.850520134 800 163.579881191 1600 162.871927023 3200 160.3128016 # Mode: Eager # Name: add_cpu_M64_K64_bwd1_N64 # Input: device: cpu, K: 64, M: 64, N: 64 Backward Execution Time (us) : 164.715 # Benchmarking PyTorch: add 200 170.650482178 400 168.895125389 800 169.867575169 1600 163.400024176 3200 168.658420444 # Mode: Eager # Name: add_cpu_M64_K64_bwd2_N64 # Input: device: cpu, K: 64, M: 64, N: 64 Backward Execution Time (us) : 168.777 Reviewed By: hl475 Differential Revision: D18438540 fbshipit-source-id: 1fd27cf4bbc34e46e74393af912ee2fcb75c33b2	2019-11-11 16:58:27 -08:00
Mingzhe Li	e86450620d	add cuda to all op benchmark (#29285 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29285 as title Test Plan: ``` buck run mode/dev-nosan //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: ConvTranspose2d # Mode: Eager # Name: ConvTranspose2d_kernel3_out_c256_H16_in_c256_N1_stride1_W16_cpu # Input: kernel: 3, out_c: 256, H: 16, in_c: 256, N: 1, stride: 1, W: 16, device: cpu Forward Execution Time (us) : 10434.151 Reviewed By: hl475 Differential Revision: D18338258 fbshipit-source-id: 944e87d1ec70daadb205faaf2825d4a2202086c5	2019-11-06 09:37:00 -08:00
Mingzhe Li	94d2599d77	unify softmax benchmark (#28911 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28911 as title Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:softmax_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: Softmax # Mode: Eager # Name: Softmax_N4_C3_H256_W256_cpu # Input: N: 4, C: 3, H: 256, W: 256, device: cpu Forward Execution Time (us) : 17929.381 ... Reviewed By: hl475 Differential Revision: D18231517 fbshipit-source-id: 61f35849e1f4cf44cf09e60a7b618f8e9fc67b9c	2019-10-30 17:46:05 -07:00
Mingzhe Li	4703854321	change softmax input shape (#28836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28836 as title Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:softmax_test Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds. Changed value project.buck_out='buck-out/opt' (was 'buck-out/dev') ... and 56 more. See logs for all changes Parsing buck files: finished in 6.2 sec Creating action graph: finished in 8.8 sec Building: finished in 05:42.6 min (100%) 28336/28336 jobs, 23707 updated Total time: 05:57.7 min # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: Softmax /proc/self/fd/4/softmax_test.py:57: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument. """ # Mode: Eager # Name: Softmax_N4_C3_H256_W256 # Input: N: 4, C: 3, H: 256, W: 256 Forward Execution Time (us) : 18422.487 Reviewed By: hl475 Differential Revision: D18202335 fbshipit-source-id: 0bb376cb465d998a49196e148d48d436126ae334	2019-10-29 12:05:25 -07:00
Huamin Li	1c81d9006a	increase input shape to reduce variance (#25812 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25812 as title Test Plan: ``` [huaminli@devvm2388.ftw3 ~/fbsource/fbcode] buck run mode/dev-nosan caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --operators None --iterations 3 ``` last few lines of the output P109238440 Reviewed By: mingzhe09088 Differential Revision: D17246792 fbshipit-source-id: d93ee5f404164d32210968997c6ea63b82058d2a	2019-09-07 06:25:26 -07:00
Mingzhe Li	b453fd9916	separate input shapes to reduce default execution time (#24136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24136 This diff aims to reduce the execution of benchmark_all_test which runs all the supported operator benchmarks. In the default run, only one shape of each operator will be benchmarked. The rest of the benchmarks can be triggered with tag_filter flag. Reviewed By: hl475 Differential Revision: D16736448 fbshipit-source-id: 33bd86f6fc2610f87f24240ad559fb11d3e35e89	2019-08-09 17:09:21 -07:00
Mingzhe Li	45aad2e680	change unary, pool, max ops to use new interface (#22661 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22661 as title Reviewed By: hl475 Differential Revision: D16170825 fbshipit-source-id: d80944224b8717e7aa35980907ff48e587b85217	2019-07-09 16:41:32 -07:00
Mingzhe Li	6cf4df5d06	add PT softmax ops to the benchmark suite (#21208 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21208 The diff adds softmax, softmax2d, and logsoftmax to the benchmark suite. Reviewed By: zheng-xq Differential Revision: D15526265 fbshipit-source-id: b7ba63032dba7146765513c8cb1ac5a6a7bd1a68	2019-06-28 13:58:20 -07:00

17 Commits