cyy
7c90a82970
[Reland] [5/N] Change static functions in headers to inline ( #131010 )
...
Reland of #130673
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010
Approved by: https://github.com/Skylion007
2024-07-18 15:53:48 +00:00
PyTorch MergeBot
c0897919da
Revert " [5/N] Change static functions in headers to inline ( #130673 )"
...
This reverts commit 4410c44ae6 .
Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk. Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368 ))
2024-07-15 03:27:11 +00:00
cyy
4410c44ae6
[5/N] Change static functions in headers to inline ( #130673 )
...
Follows #128286
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673
Approved by: https://github.com/ezyang
2024-07-14 03:15:28 +00:00
Nikita Shulga
53e32d12c4
[c10] Use nested namespace in c10/cuda ( #116464 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464
Approved by: https://github.com/Skylion007
2023-12-27 23:14:00 +00:00
Nikita Shulga
2564c0c889
avoid CPU std::copysign segfault when compiling on arm64 (take-2) ( #55608 )
...
Summary:
Re-land of https://github.com/pytorch/pytorch/issues/51834
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55608
Reviewed By: ngimel
Differential Revision: D27649077
Pulled By: malfet
fbshipit-source-id: 1a21611fb12106f75fe50e8f9f14796ab6ab9464
2021-04-08 11:34:09 -07:00
Natalia Gimelshein
b39eeb07ed
Revert D27622277: [pytorch][PR] avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA
...
Test Plan: revert-hammer
Differential Revision:
D27622277 (3bb1f59a9c )
Original commit changeset: a1dc4c3a67f9
fbshipit-source-id: 352443cec6ae0ba794e559f92578192cefbe2ab4
2021-04-07 18:25:32 -07:00
Thomas Viehmann
3bb1f59a9c
avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA ( #51834 )
...
Summary:
It seems that the std::copysign code introduced in https://github.com/pytorch/pytorch/issues/51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.
A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}.
After https://github.com/pytorch/pytorch/issues/51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51834
Reviewed By: mrshenli
Differential Revision: D27622277
Pulled By: malfet
fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f
2021-04-07 09:31:13 -07:00
Erjia Guan
f1ac63d324
Implement copysign ( #46396 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396
Related #38349
[numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign )
- No in-place function
- No method
- Optional output
- Available: byte, char, bool, int, short, long, float, double, half
- Integral promoted to float
- Not available: float/double complex
`c = np.copysign(a, b)`
| a | b | c | a.grad |
| -1 | -1 | -1 | 1 |
| -0 | -1 | -0 | 0 |
| 0 | -1 | -0 | 0 |
| 1 | -1 | -1 | -1 |
| -1 | -0 | -1 | 1 |
| -0 | -0 | 0 | 0 |
| 0 | -0 | 0 | 0 |
| 1 | -0 | -1 | -1 |
| -1 | 0 | 1 | -1 |
| -0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 |
| -1 | 1 | 1 | -1 |
| -0 | 1 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 |
This function becomes **non-differentiable** at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0.
TODO:
- [x] test (cpu/gpu)
- [x] doc
- [x] ~kernel_vec~
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D24401366
Pulled By: ejguan
fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d
2020-11-04 08:08:57 -08:00
Masaki Kozuki
6fcabf619d
[takeover] BTRS algorithm for fast/efficient binomial sampling ( #36858 )
...
Summary:
The original PR is https://github.com/pytorch/pytorch/pull/31278 .
CC: ezyang jamestwebber fritzo zasdfgbnm
---
<!-- # This PR - CPU
In [1]: import torch; import torch.distributions as dist
In [2]: counts = torch.randint(10, 1000, [1000,1000])
...: p = 0.5 * torch.ones(1000, 1000)
In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
94.8 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
-->
```
# This PR - GPU
In [1]: import torch; import torch.distributions as dist
In [2]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()
In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
737 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# master (commit: 806f22b167 ) - GPU
In [5]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()
In [6]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
46.3 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36858
Differential Revision: D21178367
Pulled By: ezyang
fbshipit-source-id: 7e7d6f463e35b07156d69bd7452040b2f9c2eb7a
2020-04-22 15:53:41 -07:00
Xiaomeng Yang
0f3b6f3dec
Add min function to cuda math compat ( #34723 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34723
Add min function to cuda math compat
Test Plan: unittest
Reviewed By: houseroad
Differential Revision: D20444517
fbshipit-source-id: 1a93343cc57249ef1101eeb7ef373266f6a2873a
2020-03-13 14:31:09 -07:00
Xiaomeng Yang
6b1db202bc
Add tanh to c10::cuda::compat ( #31844 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31844
Add tanh to c10::cuda::compat
Test Plan: unittest
Reviewed By: bddppq
Differential Revision: D19277230
fbshipit-source-id: d2cceea58722393ecb90aacec05b692dbb92d467
2020-01-03 14:27:36 -08:00
Xiaomeng Yang
8b87f9a510
Add fused layer norm impl on CUDA in PyTorch ( #27634 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27634
Add fused layer norm impl on CUDA in PyTorch
Performance benchmark compare to apex.FusedLayerNorm on a V100 machine.
**************************************
Shape = (128, 2097152 )
curr LayerNorm forward: 7.252584544941783ms
apex LayerNorm forward: 10.366813436849043ms
curr LayerNorm backward: 15.568048988003284ms
apex LayerNorm backward: 20.869979876093566ms
**************************************
Shape = (256, 1048576)
curr LayerNorm forward: 5.185673736967146ms
apex LayerNorm forward: 6.3868385690730065ms
curr LayerNorm backward: 13.942008479032665ms
apex LayerNorm backward: 15.469660016940907ms
**************************************
Shape = (512, 524288)
curr LayerNorm forward: 4.672068868065253ms
apex LayerNorm forward: 4.717993081081659ms
curr LayerNorm backward: 13.46354596503079ms
apex LayerNorm backward: 14.04774487693794ms
**************************************
Shape = (1024, 262144)
curr LayerNorm forward: 4.547273400006816ms
apex LayerNorm forward: 5.378365494078025ms
curr LayerNorm backward: 13.425063178874552ms
apex LayerNorm backward: 14.235145597020164ms
**************************************
Shape = (2048, 131072)
curr LayerNorm forward: 4.526399010093883ms
apex LayerNorm forward: 4.775081946980208ms
curr LayerNorm backward: 13.222738380078226ms
apex LayerNorm backward: 13.59594238596037ms
**************************************
Shape = (4096, 65536)
curr LayerNorm forward: 4.28789056581445ms
apex LayerNorm forward: 4.48913648002781ms
curr LayerNorm backward: 13.026655421825126ms
apex LayerNorm backward: 13.57052089786157ms
**************************************
Shape = (8192, 32768)
curr LayerNorm forward: 4.243518367875367ms
apex LayerNorm forward: 4.34588153520599ms
curr LayerNorm backward: 13.140627697808668ms
apex LayerNorm backward: 13.49891544203274ms
**************************************
Shape = (16384, 16384)
curr LayerNorm forward: 4.181216162163764ms
apex LayerNorm forward: 4.268723972840235ms
curr LayerNorm backward: 13.035593512002379ms
apex LayerNorm backward: 13.463351831072941ms
**************************************
Shape = (32768, 8192)
curr LayerNorm forward: 4.097899778978899ms
apex LayerNorm forward: 4.109480210812762ms
curr LayerNorm backward: 13.041268918896094ms
apex LayerNorm backward: 13.586135944118723ms
Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "LayerNorm"
Reviewed By: houseroad
Differential Revision: D17462420
fbshipit-source-id: d4a67d160bb4eff73ffac64af46c56c3845cf211
2019-10-14 21:26:33 -07:00
Xiaomeng Yang
93ae040ff0
Add gelu activation in pytorch ( #20665 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20665
Add gelu activation forward on CPU in pytorch
Compare to current python implemented version of gelu in BERT model like
def gelu(self, x):
x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two))
The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm.
Reviewed By: zheng-xq
Differential Revision: D15400974
fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121
2019-06-02 09:08:47 -07:00
bddppq
de0784510d
Remove disabled_features in hipify
...
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15098
Reviewed By: ezyang
Differential Revision: D13453762
Pulled By: bddppq
fbshipit-source-id: e177042c78f5bf393163d660c25b80285353853d
2018-12-13 15:43:57 -08:00
Edward Yang
fed8d8975a
Various improvements to hipify_python.py ( #13973 )
...
Summary:
- Speed up hipify_python.py by blacklisting useless (and quite large)
directory trees that it would otherwise recurse into
- Pass around relative paths instead of absolute paths. This makes it
easier to do filename matches based on the root of the tree.
- Redo the streaming output to contain more useful information
- Make it handle c10/cuda correctly, rewrite c10::cuda to
c10::hip, and the header name from CUDAMathCompat.h to
CUDAHIPCompat.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13973
Differential Revision: D13062374
Pulled By: ezyang
fbshipit-source-id: f0858dd18c94d449ff5dbadc22534c695dc0f8fb
2018-11-14 17:11:24 -08:00