Commit Graph

15 Commits

Author SHA1 Message Date
cyy
7c90a82970 [Reland] [5/N] Change static functions in headers to inline (#131010)
Reland of #130673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010
Approved by: https://github.com/Skylion007
2024-07-18 15:53:48 +00:00
PyTorch MergeBot
c0897919da Revert " [5/N] Change static functions in headers to inline (#130673)"
This reverts commit 4410c44ae6.

Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk.  Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))
2024-07-15 03:27:11 +00:00
cyy
4410c44ae6 [5/N] Change static functions in headers to inline (#130673)
Follows #128286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673
Approved by: https://github.com/ezyang
2024-07-14 03:15:28 +00:00
Nikita Shulga
53e32d12c4 [c10] Use nested namespace in c10/cuda (#116464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464
Approved by: https://github.com/Skylion007
2023-12-27 23:14:00 +00:00
Nikita Shulga
2564c0c889 avoid CPU std::copysign segfault when compiling on arm64 (take-2) (#55608)
Summary:
Re-land of https://github.com/pytorch/pytorch/issues/51834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55608

Reviewed By: ngimel

Differential Revision: D27649077

Pulled By: malfet

fbshipit-source-id: 1a21611fb12106f75fe50e8f9f14796ab6ab9464
2021-04-08 11:34:09 -07:00
Natalia Gimelshein
b39eeb07ed Revert D27622277: [pytorch][PR] avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA
Test Plan: revert-hammer

Differential Revision:
D27622277 (3bb1f59a9c)

Original commit changeset: a1dc4c3a67f9

fbshipit-source-id: 352443cec6ae0ba794e559f92578192cefbe2ab4
2021-04-07 18:25:32 -07:00
Thomas Viehmann
3bb1f59a9c avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA (#51834)
Summary:
It seems that the std::copysign code introduced in https://github.com/pytorch/pytorch/issues/51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign.

A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}.

After https://github.com/pytorch/pytorch/issues/51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51834

Reviewed By: mrshenli

Differential Revision: D27622277

Pulled By: malfet

fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f
2021-04-07 09:31:13 -07:00
Erjia Guan
f1ac63d324 Implement copysign (#46396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396

Related #38349

[numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign)
- No in-place function
- No method
- Optional output
- Available: byte, char, bool, int, short, long, float, double, half
- Integral promoted to float
- Not available: float/double complex

`c = np.copysign(a, b)`
|  a |  b |  c | a.grad |
| -1 | -1 | -1 |   1  |
| -0 | -1 | -0 |   0  |
|  0 | -1 | -0 |  0  |
|  1 | -1 | -1 |  -1  |
| -1 | -0 |  -1 |  1  |
| -0 | -0 |  0 |  0  |
|  0 | -0 |  0 |   0  |
|  1 | -0 |  -1 |   -1  |
| -1 |  0 |  1 |  -1  |
| -0 |  0 |  0 |  0  |
|  0 |  0 |  0 |   0  |
|  1 |  0 |  1 |   1  |
| -1 |  1 |  1 |  -1  |
| -0 |  1 |  0 |  0  |
|  0 |  1 |  0 |   0  |
|  1 |  1 |  1 |   1  |

This function becomes **non-differentiable** at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0.

TODO:
- [x] test (cpu/gpu)
- [x] doc
- [x] ~kernel_vec~

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24401366

Pulled By: ejguan

fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d
2020-11-04 08:08:57 -08:00
Masaki Kozuki
6fcabf619d [takeover] BTRS algorithm for fast/efficient binomial sampling (#36858)
Summary:
The original PR is https://github.com/pytorch/pytorch/pull/31278.

CC: ezyang jamestwebber fritzo zasdfgbnm

 ---

<!-- # This PR - CPU
In [1]: import torch; import torch.distributions as dist

In [2]: counts = torch.randint(10, 1000, [1000,1000])
   ...: p = 0.5 * torch.ones(1000, 1000)

In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
94.8 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
-->
```
# This PR - GPU
In [1]: import torch; import torch.distributions as dist

In [2]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()

In [3]:  %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
737 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# master (commit: 806f22b167) - GPU
In [5]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()

In [6]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
46.3 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36858

Differential Revision: D21178367

Pulled By: ezyang

fbshipit-source-id: 7e7d6f463e35b07156d69bd7452040b2f9c2eb7a
2020-04-22 15:53:41 -07:00
Xiaomeng Yang
0f3b6f3dec Add min function to cuda math compat (#34723)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34723

Add min function to cuda math compat

Test Plan: unittest

Reviewed By: houseroad

Differential Revision: D20444517

fbshipit-source-id: 1a93343cc57249ef1101eeb7ef373266f6a2873a
2020-03-13 14:31:09 -07:00
Xiaomeng Yang
6b1db202bc Add tanh to c10::cuda::compat (#31844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31844

Add tanh to c10::cuda::compat

Test Plan: unittest

Reviewed By: bddppq

Differential Revision: D19277230

fbshipit-source-id: d2cceea58722393ecb90aacec05b692dbb92d467
2020-01-03 14:27:36 -08:00
Xiaomeng Yang
8b87f9a510 Add fused layer norm impl on CUDA in PyTorch (#27634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27634

Add fused layer norm impl on CUDA in PyTorch

Performance benchmark compare to apex.FusedLayerNorm on a V100 machine.

**************************************
Shape = (128, 2097152)
  curr LayerNorm forward: 7.252584544941783ms
  apex LayerNorm forward: 10.366813436849043ms
  curr LayerNorm backward: 15.568048988003284ms
  apex LayerNorm backward: 20.869979876093566ms
**************************************
Shape = (256, 1048576)
  curr LayerNorm forward: 5.185673736967146ms
  apex LayerNorm forward: 6.3868385690730065ms
  curr LayerNorm backward: 13.942008479032665ms
  apex LayerNorm backward: 15.469660016940907ms
**************************************
Shape = (512, 524288)
  curr LayerNorm forward: 4.672068868065253ms
  apex LayerNorm forward: 4.717993081081659ms
  curr LayerNorm backward: 13.46354596503079ms
  apex LayerNorm backward: 14.04774487693794ms
**************************************
Shape = (1024, 262144)
  curr LayerNorm forward: 4.547273400006816ms
  apex LayerNorm forward: 5.378365494078025ms
  curr LayerNorm backward: 13.425063178874552ms
  apex LayerNorm backward: 14.235145597020164ms
**************************************
Shape = (2048, 131072)
  curr LayerNorm forward: 4.526399010093883ms
  apex LayerNorm forward: 4.775081946980208ms
  curr LayerNorm backward: 13.222738380078226ms
  apex LayerNorm backward: 13.59594238596037ms
**************************************
Shape = (4096, 65536)
  curr LayerNorm forward: 4.28789056581445ms
  apex LayerNorm forward: 4.48913648002781ms
  curr LayerNorm backward: 13.026655421825126ms
  apex LayerNorm backward: 13.57052089786157ms
**************************************
Shape = (8192, 32768)
  curr LayerNorm forward: 4.243518367875367ms
  apex LayerNorm forward: 4.34588153520599ms
  curr LayerNorm backward: 13.140627697808668ms
  apex LayerNorm backward: 13.49891544203274ms
**************************************
Shape = (16384, 16384)
  curr LayerNorm forward: 4.181216162163764ms
  apex LayerNorm forward: 4.268723972840235ms
  curr LayerNorm backward: 13.035593512002379ms
  apex LayerNorm backward: 13.463351831072941ms
**************************************
Shape = (32768, 8192)
  curr LayerNorm forward: 4.097899778978899ms
  apex LayerNorm forward: 4.109480210812762ms
  curr LayerNorm backward: 13.041268918896094ms
  apex LayerNorm backward: 13.586135944118723ms

Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "LayerNorm"

Reviewed By: houseroad

Differential Revision: D17462420

fbshipit-source-id: d4a67d160bb4eff73ffac64af46c56c3845cf211
2019-10-14 21:26:33 -07:00
Xiaomeng Yang
93ae040ff0 Add gelu activation in pytorch (#20665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20665

Add gelu activation forward on CPU in pytorch

Compare to current python implemented version of gelu in BERT model like

  def gelu(self, x):
      x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two))

The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm.

Reviewed By: zheng-xq

Differential Revision: D15400974

fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121
2019-06-02 09:08:47 -07:00
bddppq
de0784510d Remove disabled_features in hipify
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15098

Reviewed By: ezyang

Differential Revision: D13453762

Pulled By: bddppq

fbshipit-source-id: e177042c78f5bf393163d660c25b80285353853d
2018-12-13 15:43:57 -08:00
Edward Yang
fed8d8975a Various improvements to hipify_python.py (#13973)
Summary:
- Speed up hipify_python.py by blacklisting useless (and quite large)
  directory trees that it would otherwise recurse into

- Pass around relative paths instead of absolute paths.  This makes it
  easier to do filename matches based on the root of the tree.

- Redo the streaming output to contain more useful information

- Make it handle c10/cuda correctly, rewrite c10::cuda to
  c10::hip, and the header name from CUDAMathCompat.h to
  CUDAHIPCompat.h

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13973

Differential Revision: D13062374

Pulled By: ezyang

fbshipit-source-id: f0858dd18c94d449ff5dbadc22534c695dc0f8fb
2018-11-14 17:11:24 -08:00