Aaron Gokaslan
dbad6d71c7
[BE][Ez]: Unskip conv1d MPS test ( #154795 )
...
Fixes issue I noticed where conv1d test is skipped for complex types unconditionally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154795
Approved by: https://github.com/jansel
2025-05-31 23:01:19 +00:00
eqy
823a35807c
[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions ( #153101 )
...
For #152816
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-20 20:19:03 +00:00
PyTorch MergeBot
bf0fe4f828
Revert "[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions ( #153101 )"
...
This reverts commit ced90d23d3 .
Reverted https://github.com/pytorch/pytorch/pull/153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](https://github.com/pytorch/pytorch/pull/153101#issuecomment-2881208171 ))
2025-05-14 18:52:07 +00:00
eqy
ced90d23d3
[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions ( #153101 )
...
For #152816
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-14 15:22:47 +00:00
Eddie Yan
ec68d082a1
[CUDA][TF32] Account for TF32 in test_conv2d_same_padding ( #152618 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152618
Approved by: https://github.com/msaroufim , https://github.com/Skylion007
2025-05-02 20:19:00 +00:00
Jagadish Krishnamoorthy
0d99b4e9e2
ROCm: Enable tf32 testing on test_nn ( #148945 )
...
Add tf32 support for ROCm tests.
test command: python test/test_nn.py -v
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148945
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-28 23:01:04 +00:00
Alvaro-Kothe
8ce3d4a541
test(Conv3d): use correct class for test_Conv3d_module_same_padding ( #152187 )
...
The test for the class `Conv3d` is calling `Conv2d`. This PR just ensure that we are testing the correct module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152187
Approved by: https://github.com/Skylion007
2025-04-28 16:59:12 +00:00
cyy
970fefcc53
Remove outdated skipCUDAIfCudnnVersionLessThan decoration ( #148940 )
...
Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940
Approved by: https://github.com/mikaylagawarecki
2025-03-13 18:02:50 +00:00
cyy
a5f6b24d87
Remove outdated skipIfRocmVersionLessThan decorations ( #148941 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148941
Approved by: https://github.com/jeffdaily
2025-03-11 18:37:40 +00:00
Jeff Daily
44248c44eb
[ROCm] miopen benchmark behavior now better aligns with cudnn ( #145294 )
...
The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145294
Approved by: https://github.com/BrianHarrisonAMD , https://github.com/malfet
2025-02-05 17:19:53 +00:00
Benjamin Glass
5aa5a5763e
[inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 ( #145684 )
...
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:
1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.
To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
2025-01-28 22:01:08 +00:00
PyTorch MergeBot
6a4fb4b615
Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 ( #142859 )"
...
This reverts commit cb814c0b96 .
Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/malfet due to It broke ROCM tests again, see 5cd2b34e82/1 ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2614523822 ))
2025-01-26 17:49:05 +00:00
Wu, Chunyuan
cb814c0b96
Align CPU behavior with CUDA for ConvTranspose when out_channels=0 ( #142859 )
...
Fixes https://github.com/pytorch/pytorch/issues/142466 .
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.
Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima , https://github.com/malfet
2025-01-26 01:56:40 +00:00
PyTorch MergeBot
d95a6babcc
Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 ( #142859 )"
...
This reverts commit 0bff377880 .
Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failures look legit ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2608631019 ))
2025-01-23 01:10:31 +00:00
Wu, Chunyuan
0bff377880
Align CPU behavior with CUDA for ConvTranspose when out_channels=0 ( #142859 )
...
Fixes https://github.com/pytorch/pytorch/issues/142466 .
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.
Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima , https://github.com/malfet
2025-01-22 17:52:53 +00:00
Tom Ritchford
eaef613688
Fix issue with test/nn/test_convolution:TestConvolutionNNDeviceTypeCUDA.test_conv_large_batch_1_cuda ( #145067 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145067
Approved by: https://github.com/Skylion007 , https://github.com/nWEIdia
Co-authored-by: Wei Wang <143543872+nWEIdia@users.noreply.github.com>
2025-01-17 20:31:25 +00:00
Tom Ritchford
c947a7d38e
Fix unused Python variables in test/nn ( #143396 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143396
Approved by: https://github.com/mikaylagawarecki
2024-12-18 03:30:54 +00:00
Nikita Shulga
9c88b08ac9
[BE] Replace skipIfMPS with expectedFailureMPS ( #139940 )
...
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99 , https://github.com/huydhn
2024-11-15 03:48:37 +00:00
Eddie Yan
846b4e614b
[TF32][cuDNN][Convolution] Add some missing TF32 decorators ( #138768 )
...
Newer cuDNN versions seem to be able to dispatch to cuDNN kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138768
Approved by: https://github.com/Skylion007
2024-10-25 19:03:42 +00:00
Siddharth Kotapati
e27c0048db
Enable additional tests for MPS CI runs ( #134356 )
...
As part of the follow up for https://github.com/pytorch/pytorch/issues/133520 , adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356
Approved by: https://github.com/malfet , https://github.com/eqy , https://github.com/huydhn
2024-10-04 21:52:38 +00:00
Mikayla Gawarecki
d9576c9440
Fix failures when default is flipped for weights_only ( #127627 )
...
Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627
Approved by: https://github.com/albanD
ghstack dependencies: #132349
2024-08-16 00:22:43 +00:00
Xuehai Pan
fbe6f42dcf
[BE][Easy][8/19] enforce style for empty lines in import segments in test/[k-p]*/ ( #129759 )
...
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501 . Most changes are auto-generated by linter.
You can review these PRs via:
```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129759
Approved by: https://github.com/justinchuby , https://github.com/ezyang
2024-07-31 02:09:20 +00:00
eellison
28f29e074b
Dont mutate tensor stride in place in cudnn conv ( #126786 )
...
Fix for https://github.com/pytorch/pytorch/issues/126241 .
Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: d7ccb5b3c4/include/cudnn_frontend_Tensor.h (L201-L203) .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126786
Approved by: https://github.com/ezyang , https://github.com/shunting314 , https://github.com/eqy
2024-05-22 01:53:44 +00:00
eqy
973d724e21
[CUDA] Fix 64-bit indexing in vol2col in conv3d ( #124650 )
...
Similar to #118005 , fixes sometimes silent IMAs that occur
CC @atalman @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124650
Approved by: https://github.com/soulitzer
2024-04-25 23:21:43 +00:00
PyTorch MergeBot
24ed909934
Revert "[CUDA] Fix 64-bit indexing in vol2col in conv3d ( #124650 )"
...
This reverts commit 71d92bace2 .
Reverted https://github.com/pytorch/pytorch/pull/124650 on behalf of https://github.com/jeanschmidt due to Reverting to check if it introduced regressions for linux-focal-rocm6.0-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/124650#issuecomment-2076786795 ))
2024-04-25 09:46:21 +00:00
Eddie Yan
71d92bace2
[CUDA] Fix 64-bit indexing in vol2col in conv3d ( #124650 )
...
Similar to #118005 , fixes sometimes silent IMAs that occur
CC @atalman @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124650
Approved by: https://github.com/soulitzer
2024-04-24 19:47:18 +00:00
Yuanhao Ji
a625705290
Enable UFMT on all of test/nn ( #123809 )
...
Part of: #123062
Ran lintrunner on:
- `test/nn`
with command:
```bash
lintrunner -a --take UFMT --all-files
```
Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123809
Approved by: https://github.com/mikaylagawarecki
2024-04-12 18:32:25 +00:00
eqy
624e58f2c6
[CUDA] Update size_1 conv tests with TF32 thresholds ( #118022 )
...
Seeing some numerical mismatches on A100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118022
Approved by: https://github.com/atalman
2024-04-09 23:49:40 +00:00
Eddie Yan
3db618d656
[CUDA] Use 64-bit indexing in CUDA_KERNEL_LOOP in im2col ( #118005 )
...
#117736
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118005
Approved by: https://github.com/atalman
2024-04-09 21:04:20 +00:00
Xia Weiwen
d1510e01fa
Upgrade submodule onednn to v3.3.5 ( #120767 )
...
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346 , https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700 .
Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2).
1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843 )
Validation results with this patch: Latency increased by 0.60%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
metrics-1484287.json
{
"name": "cpu",
"environ": {
"pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
},
"metrics": {
"latency": 418.851717
}
}
oneDNN v3.3.4
{
"name": "cpu",
"environ": {
"pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
},
"metrics": {
"latency": 421.381313
}
}
```
2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592 )
Validation results with this patch: Latency reduced by 3.23%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
(inductor speedup over eager mode) 2.876x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0
oneDNN v3.3.4
(inductor speedup over eager mode) 3.003x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0
```
3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962 )
Validation results with this patch: Latency reduced by 0.85%
```
Tested on an AWS spr metal instance
oneDNN v3.1.1
(inductor speedup over eager mode) 1.120x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4
oneDNN v3.3.4
(inductor speedup over eager mode) 1.134x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4
```
The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues.
- https://github.com/pytorch/pytorch/issues/120211
- https://github.com/pytorch/pytorch/issues/120406
- https://github.com/pytorch/pytorch/issues/120547
-----
Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found.
I. *torchbench CPU userbenchmark test*
Suite | Speedup
-- | --
eager_throughtput_bf16_infer | 1.001848
eager_throughtput_fp32_infer | 1.000257
eager_throughtput_fx_int8 | 1.003069
jit_llga_throughtput_amp_bf16 | 1.000682
jit_llga_throughtput_fp32 | 1.000313
eager_throughtput_bf16_train | 0.998222
eager_throughtput_fp32_train | 1.003384
II. *Inductor FP32/AMP inference tests*
i. FP32 static default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.09
timm_models | tinynet_a | multiple | 128 | 1.14
ii. FP32 dynamic default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | alexnet | multiple | 128 | 1.08
torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98
torchbench | timm_efficientnet | multiple | 64 | 1.08
iii. AMP static default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | hf_distil_whisper | multiple | 1 | 1.18
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | BartForConditionalGeneration | multiple | 2 | 1.19
timm_models | eca_halonext26ts | multiple | 128 | 1.13
timm_models | nfnet_l0 | multiple | 128 | 1.13
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | spnasnet_100 | multiple | 128 | 1.15
timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22
timm_models | tinynet_a | multiple | 128 | 1.49
torchbench | hf_Bert_large | single | 1 | 1.16
huggingface | XLNetLMHeadModel | single | 1 | 1.07
iv. AMP dynamic default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14
timm_models | nfnet_l0 | multiple | 128 | 1.15
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | tinynet_a | multiple | 128 | 1.34
huggingface | XLNetLMHeadModel | single | 1 | 1.09
-----
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767
Approved by: https://github.com/chuanqi129 , https://github.com/jgong5 , https://github.com/atalman
2024-03-11 12:56:59 +00:00
Eddie Yan
d790c1dca6
[CUDA][cuDNN][TF32] Misc TF32 updates ( #118781 )
...
Twiddle some thresholds that don't seem to play nice with sm90.
CC @tinglvv @nWEIdia @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118781
Approved by: https://github.com/ezyang
2024-02-01 15:32:50 +00:00
Damien
2d2016fdf8
WIP Add compatibility with channels_last_3d for conv3d ( #114790 )
...
Part of a multi-PR work to fix #59168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114790
Approved by: https://github.com/albanD
2023-12-20 19:28:25 +00:00
PyTorch MergeBot
a7bfa04da6
Revert "More markDynamoStrictTest ( #115870 )"
...
This reverts commit 7f686c8fe1 .
Reverted https://github.com/pytorch/pytorch/pull/115870 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff ([comment](https://github.com/pytorch/pytorch/pull/115870#issuecomment-1862997125 ))
2023-12-19 15:40:57 +00:00
rzou
7f686c8fe1
More markDynamoStrictTest ( #115870 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845 , #115855 , #115856 , #115857 , #115858
2023-12-15 05:26:54 +00:00
Jithun Nair
2ea2421b44
Skip unit tests that fail on MI210 runners ( #114613 )
...
Taken from https://github.com/pytorch/pytorch/pull/105980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613
Approved by: https://github.com/malfet
2023-11-27 22:25:35 +00:00
rraminen
44367c59b2
Update skip reason for failing unit tests on ROCm 5.7 ( #113286 )
...
Follow up to https://github.com/pytorch/pytorch/pull/110465 . Updated skip reason for failing unit tests on ROCm 5.7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113286
Approved by: https://github.com/malfet
2023-11-13 19:29:04 +00:00
rraminen
3a429423fc
Upgrade CI to ROCm5.7 ( #110465 )
...
This PR is to upgrade CI to ROCm5.7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110465
Approved by: https://github.com/pruthvistony , https://github.com/malfet
2023-11-08 06:11:10 +00:00
Pruthvi Madugundu
9ce2e02fd6
Revert "[ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag ( #90725 )" ( #110319 )
...
This reverts commit 66bfcd32fd .
NHWC is have perf regression on MIOpen, so reverting till the performance issue is fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110319
Approved by: https://github.com/jeffdaily , https://github.com/jithunnair-amd , https://github.com/kit1980
2023-10-03 19:14:47 +00:00
CaoE
7c9052165a
add fp16 support for native conv and deconv on CPU ( #99497 )
...
### Testing
Native conv vs. mkldnn conv on SPR (with avx512_fp16 support)
Single core:
Input | Naïve impl / us | oneDNN / us | Speed up
-- | -- | -- | --
IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 34676789 | 524199.8 | 66.15185
IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 33454125 | 349844.4 | 95.62573
IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 317650.1 | 2317.677 | 137.0554
IC: 128, OC: 256, kernel: 3, stride: 1, N: 1, L: 64 | 15334.68 | 167.264 | 91.67952
56 cores:
Input | Naïve impl / us | oneDNN / us | Speed up
-- | -- | -- | --
IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 1032064 | 11073.58 | 93.20061
IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 1000097 | 16371.19 | 61.08883
IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 981813.4 | 9008.908 | 108.9825
IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 1082606 | 10150.47 | 106.6558
IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 319980.6 | 181.598 | 1762.027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497
Approved by: https://github.com/jgong5 , https://github.com/cpuhrsch
2023-09-25 01:31:26 +00:00
Justin Chu
79c5e33349
[BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ ( #105436 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436
Approved by: https://github.com/malfet , https://github.com/albanD
2023-07-21 07:38:46 +00:00
Fuzzkatt
6d570ccd59
tf32 context fixes for various tests ( #103137 )
...
Addresses tf32 context related failures from NVIDIA internal testing for following unit tests:
H100:
- functorch/test_vmap.py: test_op_has_batch_rule
A100:
- test_expanded_weights.py: test_cnn_model_sum
- nn/test_convolution.py: test_conv2d_same_padding_backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103137
Approved by: https://github.com/zou3519
2023-06-15 02:33:12 +00:00
Fuzzkatt
f8896b7b0e
update tf32 thresholds in nn/test_convolution.py ( #102015 )
...
updated tf32 thresholds for test_cudnn_convolution_relu, test_cudnn_convolution_add_relu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102015
Approved by: https://github.com/ngimel
2023-05-24 22:42:25 +00:00
Fuzzkatt
47e9dba765
move tf32_on_and_off fix for test_convolution.py ( #102007 )
...
move tf32_on_and_off after @torch.backends.cudnn.flags(enabled=True, benchmark=False) due to @torch.backends.cudnn.flags(enabled=True, benchmark=False) overwriting tf32_on_and_off if after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102007
Approved by: https://github.com/ngimel
2023-05-24 02:23:06 +00:00
kshitij12345
3b966a6ce3
[autograd] disable backward/grad for complex scalar output ( #92753 )
...
Fixes https://github.com/pytorch/pytorch/issues/92750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92753
Approved by: https://github.com/ezyang
2023-02-23 11:38:27 +00:00
Jeff Daily
66bfcd32fd
[ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag ( #90725 )
...
Fixes #64427 . MIOpen supports ChannelsLast. No longer need to opt-in with env var.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90725
Approved by: https://github.com/malfet
2023-02-09 22:26:24 +00:00
mingfeima
26cba842ad
Optimize ConvTransposed2D with mkldnn float32 and bfloat16 on CPU ( #92530 )
...
this PR optimized `ConvTranspose2d` with oneDNN and add channels last support for it. Also the fallback path `slow_conv_transpose2d` also have channels last support. So the memory format propagation behavior would stay the same with or without oneDNN.
Replacement of https://github.com/pytorch/pytorch/pull/77060 , https://github.com/pytorch/pytorch/pull/70897 and https://github.com/pytorch/pytorch/pull/74023 which enables oneDNN for `ConvTranspose2d` and `ConvTranspose3d`
The following results collects on Skylake Xeon 8180, dual sockets, 28 cores per socket.
### single core channels last
configs | forward before/ms | forward after/ms | ratio | backward before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 181.36 | 91.16 | 1.99 | 531.38 | 124.08 | 4.28
input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 324.35 | 153.50 | 2.11 | 973.16 | 185.97 | 5.23
input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 1086.82 | 671.52 | 1.62 | 3008.94 | 1453.33 | 2.07
### single core channels first
configs | forward before/ms | forward after/ms | ratio | backward before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 138.10 | 5.94 | 23.23 | 37.97 | 11.25 | 3.38
input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 236.43 | 8.75 | 27.03 | 87.77 | 18.58 | 4.72
input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 484.39 | 37.69 | 12.85 | 185.40 | 90.57 | 2.05
### single socket channels last
configs | forward before/ms | forward after/ms | ratio | backward before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 138.10 | 5.94 | 23.23 | 37.97 | 11.25 | 3.38
input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 236.43 | 8.75 | 27.03 | 87.77 | 18.58 | 4.72
input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 484.39 | 37.69 | 12.85 | 185.40 | 90.57 | 2.0
### single socket channels first
configs | forward before/ms | forward after/ms | ratio | backward before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 132.56 | 7.19 | 18.43 | 31.43 | 11.20 | 2.81
input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 227.94 | 13.33 | 17.11 | 63.00 | 23.41 | 2.69
input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 473.68 | 52.79 | 8.97 | 150.40 | 87.33 | 1.72
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92530
Approved by: https://github.com/jgong5 , https://github.com/ezyang
2023-02-06 10:11:25 +00:00
Jeff Daily
72502b94f3
correct use of torch.backends.cudnn.flags() ( #93182 )
...
Fixes #77467 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93182
Approved by: https://github.com/ngimel
2023-01-28 06:50:06 +00:00
Eddie Yan
dabf515c18
[cuDNN][cuDNN V8 API] (re-re-re-open) cuDNN V8 API on by default ( #91117 )
...
Re-opening following #91025
CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91117
Approved by: https://github.com/ngimel
2022-12-20 18:52:29 +00:00
PyTorch MergeBot
ba7aeac37b
Revert "[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default ( #89022 )"
...
This reverts commit eecd621f06 .
Reverted https://github.com/pytorch/pytorch/pull/89022 on behalf of https://github.com/ngimel due to breaks some convolution configurations #91025
2022-12-16 23:06:35 +00:00
Eddie Yan
eecd621f06
[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default ( #89022 )
...
Testing V8 on by default again after fixes have been merged for e.g., https://github.com/pytorch/torchdynamo/issues/1833
One new failure that seems to be surfaced with V8 on appears in halonext + amp
```
RuntimeError: Internal Triton PTX codegen error:
Segmentation fault (core dumped)
```
But I'm not sure if this points to a V8 issue or a Triton issue CC @ngimel @ptrblck
Current dynamo benchmarks on A100:
v7 vs. v8
|dev |name |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3 |128 |166.0240 |165.5798 |
|cuda|beit_base_patch16_224 |64 |123.5912 |123.0797 |
|cuda|botnet26t_256 |128 |107.7343 |107.5948 |
|cuda|cait_m36_384 |4 |184.5038 |184.0271 |
|cuda|coat_lite_mini |128 |142.3061 |140.5814 |
|cuda|convit_base |64 |165.2499 |161.0743 |
|cuda|convmixer_768_32 |32 |325.6984 |325.7094 |
|cuda|convnext_base |64 |237.4632 |238.0142 |
|cuda|crossvit_9_240 |128 |72.2980 |72.4367 |
|cuda|cspdarknet53 |64 |96.6862 |96.8308 |
|cuda|deit_base_distilled_patch16_224|64 |117.6045 |117.9616 |
|cuda|dla102 |128 |182.3073 |182.2304 |
|cuda|dm_nfnet_f0 |128 |133.6011 |133.6298 |
|cuda|dpn107 |32 |148.5080 |148.5885 |
|cuda|eca_botnext26ts_256 |128 |113.8676 |113.1514 |
|cuda|eca_halonext26ts |128 |119.2242 |119.1845 |
|cuda|ese_vovnet19b_dw |128 |80.0217 |79.9438 |
|cuda|fbnetc_100 |128 |91.4548 |91.4009 |
|cuda|fbnetv3_b |128 |115.4496 |115.5058 |
|cuda|gernet_l |128 |114.8365 |114.7870 |
|cuda|ghostnet_100 |128 |58.5766 |58.5766 |
|cuda|gluon_inception_v3 |128 |165.5222 |165.7167 |
|cuda|gluon_xception65 |32 |165.8779 |165.7818 |
|cuda|gmixer_24_224 |128 |116.3611 |113.4925 |
|cuda|gmlp_s16_224 |128 |121.2607 |121.2534 |
|cuda|hrnet_w18 |128 |246.5706 |246.7599 |
|cuda|inception_v3 |128 |166.1096 |166.2034 |
|cuda|jx_nest_base |32 |93.6064 |93.4088 |
|cuda|lcnet_050 |128 |21.4156 |21.4207 |
|cuda|levit_128 |128 |27.2901 |27.2543 |
|cuda|mixer_b16_224 |128 |157.8992 |158.2878 |
|cuda|mixnet_l |128 |197.3443 |197.2125 |
|cuda|mnasnet_100 |128 |71.4604 |71.2997 |
|cuda|mobilenetv2_100 |128 |67.6080 |67.7515 |
|cuda|mobilenetv3_large_100 |128 |57.7224 |57.6591 |
|cuda|mobilevit_s |64 |93.0372 |93.0530 |
|cuda|nfnet_l0 |128 |113.1664 |113.2853 |
|cuda|pit_b_224 |64 |133.3333 |133.4153 |
|cuda|pnasnet5large |16 |238.9545 |238.8122 |
|cuda|poolformer_m36 |64 |144.2353 |144.2375 |
|cuda|regnety_002 |128 |32.8534 |32.9069 |
|cuda|repvgg_a2 |128 |102.4150 |102.3827 |
|cuda|res2net101_26w_4s |64 |120.8127 |120.8322 |
|cuda|res2net50_14w_8s |128 |149.7052 |149.8969 |
|cuda|res2next50 |128 |153.7439 |153.8215 |
|cuda|resmlp_12_224 |128 |89.1918 |86.9226 |
|cuda|resnest101e |64 |159.4706 |159.3133 |
|cuda|rexnet_100 |128 |88.0032 |88.0397 |
|cuda|sebotnet33ts_256 |64 |80.4635 |80.0120 |
|cuda|selecsls42b |128 |70.4430 |70.3663 |
|cuda|spnasnet_100 |128 |78.0537 |78.1991 |
|cuda|swin_base_patch4_window7_224 |64 |212.9073 |213.0824 |
|cuda|swsl_resnext101_32x16d |32 |193.0229 |193.0404 |
|cuda|tf_efficientnet_b0 |128 |97.1316 |97.0410 |
|cuda|tf_mixnet_l |128 |203.4956 |203.5340 |
|cuda|tinynet_a |128 |82.4038 |82.8733 |
|cuda|tnt_s_patch16_224 |128 |284.8576 |284.8867 |
|cuda|twins_pcpvt_base |64 |118.3893 |119.2329 |
|cuda|visformer_small |128 |126.0533 |126.0390 |
|cuda|vit_base_patch16_224 |64 |118.2873 |118.0573 |
|cuda|volo_d1_224 |64 |108.7764 |108.2063 |
|cuda|xcit_large_24_p8_224 |5 |100.4656 |100.5209 |
v7 vs. v8 amp
|dev |name |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3 |128 |104.9729 |105.1237 |
|cuda|beit_base_patch16_224 |64 |75.4330 |75.2039 |
|cuda|botnet26t_256 |128 |74.5149 |74.8071 |
|cuda|cait_m36_384 |4 |110.9788 |111.5170 |
|cuda|coat_lite_mini |128 |62.3618 |64.4965 |
|cuda|convit_base |64 |116.4054 |117.9129 |
|cuda|convmixer_768_32 |32 |264.4401 |264.4491 |
|cuda|convnext_base |64 |182.9009 |179.2136 |
|cuda|crossvit_9_240 |128 |48.8586 |48.8359 |
|cuda|cspdarknet53 |64 |80.0245 |80.0160 |
|cuda|deit_base_distilled_patch16_224|64 |66.5921 |66.7448 |
|cuda|dla102 |128 |116.7780 |117.1683 |
|cuda|dm_nfnet_f0 |128 |78.9322 |79.1135 |
|cuda|dpn107 |32 |85.5206 |85.7514 |
|cuda|eca_botnext26ts_256 |128 |76.3672 |77.0050 |
|cuda|eca_halonext26ts |128 |86.2458 | |
|cuda|ese_vovnet19b_dw |128 |43.2943 |43.3379 |
|cuda|fbnetc_100 |128 |54.8479 |54.9251 |
|cuda|fbnetv3_b |128 |70.7504 |71.0188 |
|cuda|gernet_l |128 |66.1607 |66.0379 |
|cuda|ghostnet_100 |128 |43.8882 |43.9336 |
|cuda|gluon_inception_v3 |128 |104.9297 |105.0204 |
|cuda|gluon_xception65 |32 |85.7118 |85.8370 |
|cuda|gmixer_24_224 |128 |75.1214 |76.1170 |
|cuda|gmlp_s16_224 |128 |76.4207 |76.6641 |
|cuda|hrnet_w18 |128 |186.1326 |186.2435 |
|cuda|inception_v3 |128 |105.0561 |105.0783 |
|cuda|jx_nest_base |32 |65.3066 |65.3245 |
|cuda|lcnet_050 |128 |14.7991 |14.8687 |
|cuda|levit_128 |128 |19.2893 |19.4772 |
|cuda|mixer_b16_224 |128 |93.9826 |94.2056 |
|cuda|mixnet_l |128 |147.1245 |147.0435 |
|cuda|mnasnet_100 |128 |39.1781 |39.2565 |
|cuda|mobilenetv2_100 |128 |42.3704 |42.3114 |
|cuda|mobilenetv3_large_100 |128 |37.2946 |37.2816 |
|cuda|mobilevit_s |64 |55.8930 |55.8934 |
|cuda|nfnet_l0 |128 |64.0448 |64.4438 |
|cuda|pit_b_224 |64 |80.6342 |80.2933 |
|cuda|pnasnet5large |16 |154.9611 |154.8654 |
|cuda|poolformer_m36 |64 |101.7489 |101.8138 |
|cuda|regnety_002 |128 |27.0939 |27.0309 |
|cuda|repvgg_a2 |128 |60.9651 |61.2533 |
|cuda|res2net101_26w_4s |64 |77.3291 |77.4739 |
|cuda|res2net50_14w_8s |128 |93.6572 |93.7221 |
|cuda|res2next50 |128 |112.4975 |112.3248 |
|cuda|resmlp_12_224 |128 |59.5422 |60.7644 |
|cuda|resnest101e |64 |97.9894 |98.3358 |
|cuda|rexnet_100 |128 |55.2218 |55.0718 |
|cuda|sebotnet33ts_256 |64 |60.4880 |60.8113 |
|cuda|selecsls42b |128 |41.4294 |41.5341 |
|cuda|spnasnet_100 |128 |45.0037 |45.0304 |
|cuda|swin_base_patch4_window7_224 |64 |98.2561 |98.6925 |
|cuda|swsl_resnext101_32x16d |32 |100.6179 |100.9195 |
|cuda|tf_efficientnet_b0 |128 |56.5344 |56.4591 |
|cuda|tf_mixnet_l |128 |153.0318 |152.9367 |
|cuda|tinynet_a |128 |54.1307 |53.9298 |
|cuda|tnt_s_patch16_224 |128 |142.4801 |142.6589 |
|cuda|twins_pcpvt_base |64 |67.9027 |67.8325 |
|cuda|visformer_small |128 |72.5589 |72.9427 |
|cuda|vit_base_patch16_224 |64 |71.4885 |71.7342 |
|cuda|volo_d1_224 |64 |69.3539 |69.5910 |
|cuda|xcit_large_24_p8_224 |5 |59.9000 |59.9699 |
v7 vs. v8 float16
|dev |name |batch_size|abs_latency|abs_latency|
|----|-------------------------------|----------|-----------|-----------|
|cuda|adv_inception_v3 |128 |104.2544 |104.2677 |
|cuda|beit_base_patch16_224 |64 |85.3601 |85.3786 |
|cuda|botnet26t_256 |128 |72.1476 |71.8277 |
|cuda|cait_m36_384 |4 |108.3075 |108.5941 |
|cuda|coat_lite_mini |128 |61.2382 |61.6049 |
|cuda|convmixer_768_32 |32 |263.3818 |263.3598 |
|cuda|convnext_base |64 |172.6821 |173.8520 |
|cuda|crossvit_9_240 |128 |44.6321 |44.6340 |
|cuda|cspdarknet53 |64 |79.3165 |79.2964 |
|cuda|deit_base_distilled_patch16_224|64 |61.9816 |62.2109 |
|cuda|dla102 |128 |115.7403 |115.9928 |
|cuda|dm_nfnet_f0 |128 |77.5434 |77.7440 |
|cuda|dpn107 |32 |83.6489 |83.5605 |
|cuda|eca_botnext26ts_256 |128 |73.9953 |74.1031 |
|cuda|eca_halonext26ts |128 |81.7951 |81.7103 |
|cuda|ese_vovnet19b_dw |128 |42.9618 |42.8853 |
|cuda|fbnetc_100 |128 |54.3590 |54.3575 |
|cuda|fbnetv3_b |128 |69.7977 |70.1696 |
|cuda|gernet_l |128 |64.8684 |65.1726 |
|cuda|ghostnet_100 |128 |43.2054 |43.1319 |
|cuda|gluon_inception_v3 |128 |104.1988 |104.3030 |
|cuda|gluon_xception65 |32 |84.2245 |84.5085 |
|cuda|gmixer_24_224 |128 |82.0418 |82.7252 |
|cuda|gmlp_s16_224 |128 |75.4792 |75.8374 |
|cuda|hrnet_w18 |128 |184.1450 |184.1848 |
|cuda|inception_v3 |128 |104.1203 |104.2536 |
|cuda|jx_nest_base |32 |58.2386 |58.4901 |
|cuda|lcnet_050 |128 |14.6409 |14.5616 |
|cuda|levit_128 |128 |22.3875 |22.4680 |
|cuda|mixer_b16_224 |128 |98.9534 |98.4730 |
|cuda|mixnet_l |128 |146.1623 |146.1947 |
|cuda|mnasnet_100 |128 |38.9208 |39.3463 |
|cuda|mobilenetv2_100 |128 |41.8946 |41.9847 |
|cuda|mobilenetv3_large_100 |128 |36.7810 |36.8264 |
|cuda|mobilevit_s |64 |55.3211 |55.3186 |
|cuda|nfnet_l0 |128 |63.1302 |63.5544 |
|cuda|pit_b_224 |64 |73.8752 |73.4602 |
|cuda|pnasnet5large |16 |151.6806 |151.6111 |
|cuda|poolformer_m36 |64 |86.8341 |86.8021 |
|cuda|regnety_002 |128 |26.6798 |26.5295 |
|cuda|repvgg_a2 |128 |61.6652 |62.1482 |
|cuda|res2net101_26w_4s |64 |75.8037 |75.7739 |
|cuda|res2net50_14w_8s |128 |92.6362 |92.4338 |
|cuda|res2next50 |128 |111.5371 |111.5832 |
|cuda|resmlp_12_224 |128 |58.2349 |57.9807 |
|cuda|resnest101e |64 |96.1114 |96.2742 |
|cuda|rexnet_100 |128 |54.8138 |54.7643 |
|cuda|sebotnet33ts_256 |64 |53.1524 |53.3823 |
|cuda|selecsls42b |128 |40.6070 |40.7104 |
|cuda|spnasnet_100 |128 |44.5732 |44.4318 |
|cuda|swin_base_patch4_window7_224 |64 |98.6447 |98.8445 |
|cuda|swsl_resnext101_32x16d |32 |97.0195 |97.2968 |
|cuda|tf_efficientnet_b0 |128 |56.0640 |56.0278 |
|cuda|tf_mixnet_l |128 |152.0958 |152.0874 |
|cuda|tinynet_a |128 |53.3694 |53.3762 |
|cuda|tnt_s_patch16_224 |128 |130.2981 |130.3726 |
|cuda|twins_pcpvt_base |64 |62.5459 |62.6416 |
|cuda|visformer_small |128 |68.8502 |69.1756 |
|cuda|vit_base_patch16_224 |64 |65.8587 |66.0285 |
|cuda|volo_d1_224 |64 |64.5348 |64.6057 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89022
Approved by: https://github.com/ngimel
2022-12-15 03:24:44 +00:00