Xiang Gao
9f336bdf10
Fixes new tf32 failures in test_nn.py ( #52871 )
...
Summary:
Also modify the `tf32_on_and_off` decorator to make it support function without `device` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52871
Reviewed By: ngimel
Differential Revision: D27286674
Pulled By: mruberry
fbshipit-source-id: 14f6d558271bd6a1d0bc40691c170d47e81de1ff
2021-03-24 21:53:33 -07:00
Sam Estep
8c798e0622
Forbid trailing whitespace ( #53406 )
...
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Rong Rong (AI Infra)
b52e2e6045
[BE] _get_torch_cuda_version should return tuple ( #52409 )
...
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52409
Reviewed By: jbschlosser, glaringlee
Differential Revision: D26513924
Pulled By: walterddr
fbshipit-source-id: ee18ef357c326c5ad344d80c59821cc2b8814734
2021-02-18 09:28:38 -08:00
Xiang Gao
b822aba8ec
Enable BFloat support for gemms on arch other than ampere ( #50442 )
...
Summary:
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50442
Reviewed By: bdhirsh
Differential Revision: D26044981
Pulled By: mruberry
fbshipit-source-id: 65c42f2c1de8d24e4852a1b5bd8f4b1735b2230e
2021-01-26 11:07:07 -08:00
Gao, Xiang
3f5eee666c
Adjust TF32 tests ( #44240 )
...
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`
cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240
Reviewed By: mruberry
Differential Revision: D23882498
Pulled By: ngimel
fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
Xiao Wang
d75c402755
Add cusolver to build, rewrite MAGMA inverse with cusolver ( #42403 )
...
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265
This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.
Specifically, when
* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,
cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.
8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)
The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.
On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.
060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)
Note that there is a new heuristic used before cusolver/cublas calls here:
8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)
where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).
Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests
Next step:
If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.
<details>
<summary> benchmark 73499c6 </summary>
benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb
shape meaning:
* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`
| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 | 0.095 | 7.534 | 0.129 |
| [] 4 torch.float32 | 0.009 | 7.522 | 0.129 |
| [] 8 torch.float32 | 0.011 | 7.647 | 0.138 |
| [] 16 torch.float32 | 0.075 | 7.582 | 0.135 |
| [] 32 torch.float32 | 0.073 | 7.573 | 0.191 |
| [] 64 torch.float32 | 0.134 | 7.694 | 0.288 |
| [] 128 torch.float32 | 0.398 | 8.073 | 0.491 |
| [] 256 torch.float32 | 1.054 | 11.860 | 1.074 |
| [] 512 torch.float32 | 5.218 | 14.130 | 2.582 |
| [] 1024 torch.float32 | 19.010 | 18.780 | 6.936 |
| [1] 2 torch.float32 | 0.009 | 0.113 | 0.128 ***regressed |
| [1] 4 torch.float32 | 0.009 | 0.113 | 0.131 ***regressed |
| [1] 8 torch.float32 | 0.011 | 0.116 | 0.129 ***regressed |
| [1] 16 torch.float32 | 0.015 | 0.122 | 0.135 ***regressed |
| [1] 32 torch.float32 | 0.032 | 0.177 | 0.178 ***regressed |
| [1] 64 torch.float32 | 0.070 | 0.420 | 0.281 |
| [1] 128 torch.float32 | 0.328 | 0.816 | 0.490 |
| [1] 256 torch.float32 | 1.125 | 1.690 | 1.084 |
| [1] 512 torch.float32 | 4.344 | 4.305 | 2.576 |
| [1] 1024 torch.float32 | 16.510 | 16.340 | 6.928 |
| [2] 2 torch.float32 | 0.009 | 0.113 | 0.186 ***regressed |
| [2] 4 torch.float32 | 0.011 | 0.115 | 0.184 ***regressed |
| [2] 8 torch.float32 | 0.012 | 0.114 | 0.184 ***regressed |
| [2] 16 torch.float32 | 0.019 | 0.119 | 0.173 ***regressed |
| [2] 32 torch.float32 | 0.050 | 0.170 | 0.240 ***regressed |
| [2] 64 torch.float32 | 0.120 | 0.429 | 0.375 |
| [2] 128 torch.float32 | 0.576 | 0.830 | 0.675 |
| [2] 256 torch.float32 | 2.021 | 1.748 | 1.451 |
| [2] 512 torch.float32 | 9.070 | 4.749 | 3.539 |
| [2] 1024 torch.float32 | 33.655 | 18.240 | 12.220 |
| [4] 2 torch.float32 | 0.009 | 0.112 | 0.318 ***regressed |
| [4] 4 torch.float32 | 0.010 | 0.115 | 0.319 ***regressed |
| [4] 8 torch.float32 | 0.013 | 0.115 | 0.320 ***regressed |
| [4] 16 torch.float32 | 0.027 | 0.120 | 0.331 ***regressed |
| [4] 32 torch.float32 | 0.085 | 0.173 | 0.385 ***regressed |
| [4] 64 torch.float32 | 0.221 | 0.431 | 0.646 ***regressed |
| [4] 128 torch.float32 | 1.102 | 0.834 | 1.055 ***regressed |
| [4] 256 torch.float32 | 4.042 | 1.811 | 2.054 ***regressed |
| [4] 512 torch.float32 | 18.390 | 4.884 | 5.087 ***regressed |
| [4] 1024 torch.float32 | 69.025 | 19.840 | 20.000 ***regressed |
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403
Reviewed By: ailzhang, mruberry
Differential Revision: D23717984
Pulled By: ngimel
fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
2020-09-18 20:43:29 -07:00
Rong Rong
b5dd6e3e61
split torch.testing._internal.* and add type checking for torch.testing._internal.common_cuda ( #44575 )
...
Summary:
First step to fix https://github.com/pytorch/pytorch/issues/42969 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44575
Reviewed By: malfet
Differential Revision: D23668740
Pulled By: walterddr
fbshipit-source-id: eeb3650b1780aaa5727b525b4e6182e1bc47a83f
2020-09-14 14:04:02 -07:00
Gao, Xiang
5e97f251a8
Enable TF32 support for cuDNN ( #40737 )
...
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737
Reviewed By: mruberry
Differential Revision: D22801525
Pulled By: ngimel
fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2
2020-09-01 15:34:24 -07:00
Xiang Gao
23174ca71b
[reland] Enable TF32 support for cuBLAS ( #41498 )
...
Summary:
fix rocm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498
Reviewed By: mruberry
Differential Revision: D22560572
Pulled By: ngimel
fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041
2020-07-15 21:00:55 -07:00
Shen Li
3a63a939d4
Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS
...
Test Plan: revert-hammer
Differential Revision:
D22517785 (288ece89e1 )
Original commit changeset: 87334c893561
fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458
2020-07-15 08:15:48 -07:00
Xiang Gao
288ece89e1
Enable TF32 support for cuBLAS ( #40800 )
...
Summary:
Benchmark on a fully connected network and torchvision models (time in seconds) on GA100:
| model | batch size | forward(TF32) | forward(FP32) | backward(TF32) | backward(FP32) |
|--------------------|------------|---------------|---------------|----------------|----------------|
| FC 512-128-32-8 | 512 | 0.000211 | 0.000321 | 0.000499 | 0.000532 |
| alexnet | 512 | 0.0184 | 0.0255 | 0.0486 | 0.0709 |
| densenet161 | 128 | 0.0665 | 0.204 | 0.108 | 0.437 |
| googlenet | 256 | 0.0925 | 0.110 | 0.269 | 0.326 |
| inception_v3 | 256 | 0.155 | 0.214 | 0.391 | 0.510 |
| mnasnet1_0 | 512 | 0.108 | 0.137 | 0.298 | 0.312 |
| mobilenet_v2 | 512 | 0.114 | 0.294 | 0.133 | 0.303 |
| resnet18 | 512 | 0.0722 | 0.100 | 0.182 | 0.228 |
| resnext50_32x4d | 256 | 0.170 | 0.237 | 0.373 | 0.479 |
| shufflenet_v2_x1_0 | 512 | 0.0463 | 0.0473 | 0.125 | 0.123 |
| squeezenet1_0 | 512 | 0.0870 | 0.0948 | 0.205 | 0.214 |
| vgg16 | 256 | 0.167 | 0.234 | 0.401 | 0.502 |
| wide_resnet50_2 | 512 | 0.186 | 0.310 | 0.415 | 0.638 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800
Reviewed By: mruberry
Differential Revision: D22517785
Pulled By: ngimel
fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e
2020-07-14 13:21:10 -07:00
Jithun Nair
dc1f9eee53
Avoid printing erroneous warning about "MIOpen not found" for ROCm builds ( #33837 )
...
Summary:
Older versions of MIOpen (<=2.2) don't have the `miopenGetVersion` api, but MIOpen is always a part of the ROCm builds, so do NOT set `lib` to None for ROCm builds. `__cudnn_version` will be `None` for older versions of MIOpen.
Setting `lib` to `None` ends up printing the following erroneous warning when running unit tests:
```
/root/.local/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py:120: UserWarning: cuDNN/MIOpen library not found. Check your LD_LIBRARY_PATH
}.get(sys.platform, 'LD_LIBRARY_PATH')))
```
Eg.: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/18387/consoleFull
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33837
Differential Revision: D20369285
Pulled By: xw285cornell
fbshipit-source-id: e82e6f8f5bccb486213cf868f40aece41ce11f98
2020-04-17 20:31:01 -07:00
Pritam Damania
f050b16dd9
Move pytorch distributed tests to separate folder for contbuild. ( #30445 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445
Create distributed and rpc directories under caffe/test for better management
of unit tests.
Differential Revision: D18702786
fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00