pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
cyy	7c90a82970	[Reland] [5/N] Change static functions in headers to inline (#131010 ) Reland of #130673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010 Approved by: https://github.com/Skylion007	2024-07-18 15:53:48 +00:00
PyTorch MergeBot	c0897919da	Revert " [5/N] Change static functions in headers to inline (#130673 )" This reverts commit `4410c44ae6`. Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk. Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))	2024-07-15 03:27:11 +00:00
cyy	4410c44ae6	[5/N] Change static functions in headers to inline (#130673 ) Follows #128286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673 Approved by: https://github.com/ezyang	2024-07-14 03:15:28 +00:00
Nikita Shulga	53e32d12c4	[c10] Use nested namespace in c10/cuda (#116464 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464 Approved by: https://github.com/Skylion007	2023-12-27 23:14:00 +00:00
Nikita Shulga	2564c0c889	avoid CPU std::copysign segfault when compiling on arm64 (take-2) (#55608 ) Summary: Re-land of https://github.com/pytorch/pytorch/issues/51834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/55608 Reviewed By: ngimel Differential Revision: D27649077 Pulled By: malfet fbshipit-source-id: 1a21611fb12106f75fe50e8f9f14796ab6ab9464	2021-04-08 11:34:09 -07:00
Natalia Gimelshein	b39eeb07ed	Revert D27622277: [pytorch][PR] avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA Test Plan: revert-hammer Differential Revision: D27622277 (`3bb1f59a9c`) Original commit changeset: a1dc4c3a67f9 fbshipit-source-id: 352443cec6ae0ba794e559f92578192cefbe2ab4	2021-04-07 18:25:32 -07:00
Thomas Viehmann	3bb1f59a9c	avoid CPU std::copysign segfault when compiling on arm64 with gcc 7.5 / 8 for CUDA (#51834 ) Summary: It seems that the std::copysign code introduced in https://github.com/pytorch/pytorch/issues/51706 is too much for gcc 7.5 / 8 when compiled on arm64 (e.g. on Jetson with latest Jetpack) and causes it to produce an internal compiler error with segfault during compilation. This avoids the compiler bug it by not using std::copysign. A very kind person sent a Jetson Xavier NX {emoji:1f381} thank you {emoji:2764}. After https://github.com/pytorch/pytorch/issues/51900 fixed this for CPU-only arm64 (eg Raspberry), this fixes it for CUDA-using arm64 (e.g. Jetson). CUDA device lambdas must also be present as host functions for technical reasons but they are never used, so we just assert in the CPU variant instead of actually doing the operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51834 Reviewed By: mrshenli Differential Revision: D27622277 Pulled By: malfet fbshipit-source-id: a1dc4c3a67f925019782e24b796919e17339749f	2021-04-07 09:31:13 -07:00
Erjia Guan	f1ac63d324	Implement copysign (#46396 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396 Related #38349 [numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign) - No in-place function - No method - Optional output - Available: byte, char, bool, int, short, long, float, double, half - Integral promoted to float - Not available: float/double complex `c = np.copysign(a, b)` \| a \| b \| c \| a.grad \| \| -1 \| -1 \| -1 \| 1 \| \| -0 \| -1 \| -0 \| 0 \| \| 0 \| -1 \| -0 \| 0 \| \| 1 \| -1 \| -1 \| -1 \| \| -1 \| -0 \| -1 \| 1 \| \| -0 \| -0 \| 0 \| 0 \| \| 0 \| -0 \| 0 \| 0 \| \| 1 \| -0 \| -1 \| -1 \| \| -1 \| 0 \| 1 \| -1 \| \| -0 \| 0 \| 0 \| 0 \| \| 0 \| 0 \| 0 \| 0 \| \| 1 \| 0 \| 1 \| 1 \| \| -1 \| 1 \| 1 \| -1 \| \| -0 \| 1 \| 0 \| 0 \| \| 0 \| 1 \| 0 \| 0 \| \| 1 \| 1 \| 1 \| 1 \| This function becomes non-differentiable at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0. TODO: - [x] test (cpu/gpu) - [x] doc - [x] ~kernel_vec~ Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D24401366 Pulled By: ejguan fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d	2020-11-04 08:08:57 -08:00
Masaki Kozuki	6fcabf619d	[takeover] BTRS algorithm for fast/efficient binomial sampling (#36858 ) Summary: The original PR is https://github.com/pytorch/pytorch/pull/31278. CC: ezyang jamestwebber fritzo zasdfgbnm --- <!-- # This PR - CPU In [1]: import torch; import torch.distributions as dist In [2]: counts = torch.randint(10, 1000, [1000,1000]) ...: p = 0.5 * torch.ones(1000, 1000) In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample() 94.8 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) --> ``` # This PR - GPU In [1]: import torch; import torch.distributions as dist In [2]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda() In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample() 737 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) # master (commit: `806f22b167`) - GPU In [5]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda() In [6]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample() 46.3 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36858 Differential Revision: D21178367 Pulled By: ezyang fbshipit-source-id: 7e7d6f463e35b07156d69bd7452040b2f9c2eb7a	2020-04-22 15:53:41 -07:00
Xiaomeng Yang	0f3b6f3dec	Add min function to cuda math compat (#34723 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34723 Add min function to cuda math compat Test Plan: unittest Reviewed By: houseroad Differential Revision: D20444517 fbshipit-source-id: 1a93343cc57249ef1101eeb7ef373266f6a2873a	2020-03-13 14:31:09 -07:00
Xiaomeng Yang	6b1db202bc	Add tanh to c10::cuda::compat (#31844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31844 Add tanh to c10::cuda::compat Test Plan: unittest Reviewed By: bddppq Differential Revision: D19277230 fbshipit-source-id: d2cceea58722393ecb90aacec05b692dbb92d467	2020-01-03 14:27:36 -08:00
Xiaomeng Yang	8b87f9a510	Add fused layer norm impl on CUDA in PyTorch (#27634 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27634 Add fused layer norm impl on CUDA in PyTorch Performance benchmark compare to apex.FusedLayerNorm on a V100 machine. ************************************ Shape = (128, `2097152`) curr LayerNorm forward: 7.252584544941783ms apex LayerNorm forward: 10.366813436849043ms curr LayerNorm backward: 15.568048988003284ms apex LayerNorm backward: 20.869979876093566ms ********************************** Shape = (256, 1048576) curr LayerNorm forward: 5.185673736967146ms apex LayerNorm forward: 6.3868385690730065ms curr LayerNorm backward: 13.942008479032665ms apex LayerNorm backward: 15.469660016940907ms ********************************** Shape = (512, 524288) curr LayerNorm forward: 4.672068868065253ms apex LayerNorm forward: 4.717993081081659ms curr LayerNorm backward: 13.46354596503079ms apex LayerNorm backward: 14.04774487693794ms ********************************** Shape = (1024, 262144) curr LayerNorm forward: 4.547273400006816ms apex LayerNorm forward: 5.378365494078025ms curr LayerNorm backward: 13.425063178874552ms apex LayerNorm backward: 14.235145597020164ms ********************************** Shape = (2048, 131072) curr LayerNorm forward: 4.526399010093883ms apex LayerNorm forward: 4.775081946980208ms curr LayerNorm backward: 13.222738380078226ms apex LayerNorm backward: 13.59594238596037ms ********************************** Shape = (4096, 65536) curr LayerNorm forward: 4.28789056581445ms apex LayerNorm forward: 4.48913648002781ms curr LayerNorm backward: 13.026655421825126ms apex LayerNorm backward: 13.57052089786157ms ********************************** Shape = (8192, 32768) curr LayerNorm forward: 4.243518367875367ms apex LayerNorm forward: 4.34588153520599ms curr LayerNorm backward: 13.140627697808668ms apex LayerNorm backward: 13.49891544203274ms ********************************** Shape = (16384, 16384) curr LayerNorm forward: 4.181216162163764ms apex LayerNorm forward: 4.268723972840235ms curr LayerNorm backward: 13.035593512002379ms apex LayerNorm backward: 13.463351831072941ms ************************************ Shape = (32768, 8192) curr LayerNorm forward: 4.097899778978899ms apex LayerNorm forward: 4.109480210812762ms curr LayerNorm backward: 13.041268918896094ms apex LayerNorm backward: 13.586135944118723ms Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "LayerNorm" Reviewed By: houseroad Differential Revision: D17462420 fbshipit-source-id: d4a67d160bb4eff73ffac64af46c56c3845cf211	2019-10-14 21:26:33 -07:00
Xiaomeng Yang	93ae040ff0	Add gelu activation in pytorch (#20665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20665 Add gelu activation forward on CPU in pytorch Compare to current python implemented version of gelu in BERT model like def gelu(self, x): x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two)) The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm. Reviewed By: zheng-xq Differential Revision: D15400974 fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121	2019-06-02 09:08:47 -07:00
bddppq	de0784510d	Remove disabled_features in hipify Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15098 Reviewed By: ezyang Differential Revision: D13453762 Pulled By: bddppq fbshipit-source-id: e177042c78f5bf393163d660c25b80285353853d	2018-12-13 15:43:57 -08:00
Edward Yang	fed8d8975a	Various improvements to hipify_python.py (#13973 ) Summary: - Speed up hipify_python.py by blacklisting useless (and quite large) directory trees that it would otherwise recurse into - Pass around relative paths instead of absolute paths. This makes it easier to do filename matches based on the root of the tree. - Redo the streaming output to contain more useful information - Make it handle c10/cuda correctly, rewrite c10::cuda to c10::hip, and the header name from CUDAMathCompat.h to CUDAHIPCompat.h Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/13973 Differential Revision: D13062374 Pulled By: ezyang fbshipit-source-id: f0858dd18c94d449ff5dbadc22534c695dc0f8fb	2018-11-14 17:11:24 -08:00

15 Commits