mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346, https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700. Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2). 1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843) Validation results with this patch: Latency increased by 0.60% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 metrics-1484287.json { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 418.851717 } } oneDNN v3.3.4 { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 421.381313 } } ``` 2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592) Validation results with this patch: Latency reduced by 3.23% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 (inductor speedup over eager mode) 2.876x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0 oneDNN v3.3.4 (inductor speedup over eager mode) 3.003x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0 ``` 3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962) Validation results with this patch: Latency reduced by 0.85% ``` Tested on an AWS spr metal instance oneDNN v3.1.1 (inductor speedup over eager mode) 1.120x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4 oneDNN v3.3.4 (inductor speedup over eager mode) 1.134x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4 ``` The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues. - https://github.com/pytorch/pytorch/issues/120211 - https://github.com/pytorch/pytorch/issues/120406 - https://github.com/pytorch/pytorch/issues/120547 ----- Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found. I. *torchbench CPU userbenchmark test* Suite | Speedup -- | -- eager_throughtput_bf16_infer | 1.001848 eager_throughtput_fp32_infer | 1.000257 eager_throughtput_fx_int8 | 1.003069 jit_llga_throughtput_amp_bf16 | 1.000682 jit_llga_throughtput_fp32 | 1.000313 eager_throughtput_bf16_train | 0.998222 eager_throughtput_fp32_train | 1.003384 II. *Inductor FP32/AMP inference tests* i. FP32 static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.09 timm_models | tinynet_a | multiple | 128 | 1.14 ii. FP32 dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | alexnet | multiple | 128 | 1.08 torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98 torchbench | timm_efficientnet | multiple | 64 | 1.08 iii. AMP static default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | hf_distil_whisper | multiple | 1 | 1.18 torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | BartForConditionalGeneration | multiple | 2 | 1.19 timm_models | eca_halonext26ts | multiple | 128 | 1.13 timm_models | nfnet_l0 | multiple | 128 | 1.13 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | spnasnet_100 | multiple | 128 | 1.15 timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22 timm_models | tinynet_a | multiple | 128 | 1.49 torchbench | hf_Bert_large | single | 1 | 1.16 huggingface | XLNetLMHeadModel | single | 1 | 1.07 iv. AMP dynamic default suite | name | thread | batch size | Ratio Speedup(New/old) -- | -- | -- | -- | -- torchbench | timm_efficientnet | multiple | 64 | 1.32 huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14 timm_models | nfnet_l0 | multiple | 128 | 1.15 timm_models | rexnet_100 | multiple | 128 | 1.45 timm_models | tinynet_a | multiple | 128 | 1.34 huggingface | XLNetLMHeadModel | single | 1 | 1.09 ----- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman |
||
|---|---|---|
| .. | ||
| test_convolution.py | ||
| test_dropout.py | ||
| test_embedding.py | ||
| test_init.py | ||
| test_lazy_modules.py | ||
| test_load_state_dict.py | ||
| test_module_hooks.py | ||
| test_multihead_attention.py | ||
| test_packed_sequence.py | ||
| test_parametrization.py | ||
| test_pooling.py | ||
| test_pruning.py | ||