pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
rzou	ea141d8134	functional compiled autograd (#144707 ) This PR squashes together the following commits: https://github.com/pytorch/pytorch/pull/144115 https://github.com/pytorch/pytorch/pull/143417 https://github.com/pytorch/pytorch/pull/143405 https://github.com/pytorch/pytorch/pull/143387 https://github.com/pytorch/pytorch/pull/143304 https://github.com/pytorch/pytorch/pull/143296 This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses. For more information, please read the commit messages for each PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707 Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel	2025-01-27 05:20:56 +00:00
Nikita Shulga	82e353fffc	[BE] Use nested namespaces in autograd/templates (#110618 ) As PyTorch can now use C++17 language features Pull Request resolved: https://github.com/pytorch/pytorch/pull/110618 Approved by: https://github.com/soulitzer	2023-10-05 22:05:57 +00:00
Jason Ansel	c902b84e0b	Compiled autograd (#103822 ) This branch: 1) converts the autograd tape into an FX graph 2) caches that conversion using a "shadow" graph 3) compiles and runs the generated FX graph instead of the normal autograd What works currently: 1) Caching, capture, and initial integration 2) Backwards hooks 3) Inlining AotAutograd generated subgraphs 4) torch.compiling the generated FX graph 5) Auto-detecting dynamic shapes based on changes Future work 1) Larger scale testing 1) Boxed calling convention, so memory can be freed incrementally 1) Support hooks on SavedTensor 1) Additional testing by running eager autograd tests under compiled_autograd.enable() Pull Request resolved: https://github.com/pytorch/pytorch/pull/103822 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-07-24 21:12:05 +00:00
albanD	73f009a2aa	refactor manual function definitions (#43711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43711 this makes them available in forward if needed No change to the file content, just a copy-paste. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D23454146 Pulled By: albanD fbshipit-source-id: 6269a4aaf02ed53870fadf8b769ac960e49af195	2020-09-02 09:23:21 -07:00
Lucas Roberts	2bede78a05	add qr_backward functionality for wide case (#42216 ) Summary: Unblocks implementation of https://github.com/pytorch/pytorch/issues/27036. Note that this PR *does not* fix #{27036}. Currently QR decomposition only has support for square and tall (a.k.a. skinny) case. This PR adds functionality for wide A matrix/tensors, includes 3 unit tests for the new case and restructures the `qr_backward` method to use the same Walther method as a helper. cc albanD t-vi I don't have a gpu machine so haven't tested on cuda but everything passes on my local machine in cpu. The basic idea of the PR is noted in the comments in the `Functions.cpp` file but I'll note here too for clarity: let <img src="https://render.githubusercontent.com/render/math?math=A_{m,n}"> be a matrix and <img src="https://render.githubusercontent.com/render/math?math=m < n"> then partition <img src="https://render.githubusercontent.com/render/math?math=A_{m, n}"> as <img src="https://render.githubusercontent.com/render/math?math=A_{m,n} = [ X_{m,m} \|\ Y_{m, n-m} ]"> and take QR of <img src="https://render.githubusercontent.com/render/math?math=X"> and call that one <img src="https://render.githubusercontent.com/render/math?math=X=QU"> the <img src="https://render.githubusercontent.com/render/math?math=Q"> here from <img src="https://render.githubusercontent.com/render/math?math=X"> is the same as the <img src="https://render.githubusercontent.com/render/math?math=Q"> from <img src="https://render.githubusercontent.com/render/math?math=QR"> on entire <img src="https://render.githubusercontent.com/render/math?math=A"> matrix. Then transform <img src="https://render.githubusercontent.com/render/math?math=Y"> with the <img src="https://render.githubusercontent.com/render/math?math=Q"> rotation got from <img src="https://render.githubusercontent.com/render/math?math=X"> to get <img src="https://render.githubusercontent.com/render/math?math=V=Q^{T}Y"> now <img src="https://render.githubusercontent.com/render/math?math=R= [U \|\ V] "> and similarly for the grads of each piece, e.g. if <img src="https://render.githubusercontent.com/render/math?math=\bar{A}"> is `grad_A` then <img src="https://render.githubusercontent.com/render/math?math=\bar{A} = [ \bar{X} \|\ \bar{Y}]"> and <img src="https://render.githubusercontent.com/render/math?math=\bar{R} = [ \bar{U} \|\ \bar{V}]"> and then <img src="https://render.githubusercontent.com/render/math?math=\bar{Y} = Q\bar{V}"> and <img src="https://render.githubusercontent.com/render/math?math=\bar{V}"> is the `narrow()` of `grad_R`. <img src="https://render.githubusercontent.com/render/math?math=\bar{X}"> is calculated very similar to the original Walther formula (exactly the same in the tall and square cases) but is slightly modified here for wide case matrices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42216 Reviewed By: glaringlee Differential Revision: D23373118 Pulled By: albanD fbshipit-source-id: 3702ba7e7e23923868c02cdb7e10a96036052344	2020-08-31 11:46:45 -07:00
Xiang Gao	a860be898e	[resubmit] Add amax/amin (#43819 ) Summary: Resubmit for landing next week. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819 Reviewed By: ngimel Differential Revision: D23421906 Pulled By: mruberry fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f	2020-08-31 04:54:48 -07:00
Nikita Shulga	3f0120edb4	Revert D23360705: [pytorch][PR] Add amax/amin Test Plan: revert-hammer Differential Revision: D23360705 (`bcec8cc3f9`) Original commit changeset: 5bdeb08a2465 fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381	2020-08-28 18:01:25 -07:00
Gao, Xiang	bcec8cc3f9	Add amax/amin (#43092 ) Summary: Add a max/min operator that only return values. ## Some important decision to discuss \| Question \| Current State \| \|---------------------------------------\|-------------------\| \| Expose torch.max_values to python? \| No \| \| Remove max_values and only keep amax? \| Yes \| \| Should amax support named tensors? \| Not in this PR \| ## Numpy compatibility Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html \| Parameter \| PyTorch Behavior \| \|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-----------------------------------------------------------------------------------\| \| `axis`: None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. \| Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137) \| \| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output. \| Same \| \| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array. \| implemented as `keepdim` \| \| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice. \| Not implemented in this PR. Better to implement for all reductions in the future. \| \| `where`: array_like of bool, optional. Elements to compare for the maximum. \| Not implemented in this PR. Better to implement for all reductions in the future. \| Note from numpy: > NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax. PyTorch has the same behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092 Reviewed By: ngimel Differential Revision: D23360705 Pulled By: mruberry fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d	2020-08-28 12:51:03 -07:00
Xiang Gao	348e78b086	Evenly distribute output grad into all matching inputs for min/max/median (#43519 ) Summary: cc: ngimel mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/43519 Reviewed By: albanD Differential Revision: D23312235 Pulled By: ngimel fbshipit-source-id: 678bda54996df7f29acf96add928bb7042fc2069	2020-08-25 16:36:33 -07:00
Xiaomeng Yang	4ae832e106	Optimize SiLU (Swish) op in PyTorch (#42976 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976 Optimize SiLU (Swish) op in PyTorch. Some benchmark result input = torch.rand(1024, 32768, dtype=torch.float, device="cpu") forward: 221ms -> 133ms backward: 600ms -> 170ms input = torch.rand(1024, 32768, dtype=torch.double, device="cpu") forward: 479ms -> 297ms backward: 1438ms -> 387ms input = torch.rand(8192, 32768, dtype=torch.float, device="cuda") forward: 24.34ms -> 9.83ms backward: 97.05ms -> 29.03ms input = torch.rand(4096, 32768, dtype=torch.double, device="cuda") forward: 44.24ms -> 30.15ms backward: 126.21ms -> 49.68ms Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU" Reviewed By: houseroad Differential Revision: D23093593 fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd	2020-08-16 13:21:57 -07:00
kshitij12345	ab0a04dc9c	Add `torch.nansum` (#38628 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/38349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628 Reviewed By: VitalyFedyunin Differential Revision: D22860549 Pulled By: mruberry fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710	2020-08-11 22:26:04 -07:00
Sebastian Messmer	1542c41a67	Change C++ frontend to take optional<Tensor> arguments (#41947 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41947 Previously, if an op took an optional `Tensor?` argument, the C++ frontend (i.e. `at::op()` and `Tensor::op()`) were generated to take `Tensor`. A previous PR (https://github.com/pytorch/pytorch/pull/41610) changed the kernels to be written with `c10::optional<Tensor>` instead of `Tensor`, but that did not touch the C++ frontend yet. This PR changes the C++ frontend API to take `c10::optional<Tensor>` instead of `Tensor` as well. This should be mostly bc conserving. Since `Tensor` implicitly converts to `c10::optional<Tensor>`, any old code calling an op with a `Tensor` would still work. There are likely corner cases that get broken though. For example, C++ only ever does one implicit conversion. So if you call an op with a non-tensor object that gets implicitly converted to a `Tensor`, then that previously worked since the API took a `Tensor` and C++ allows one implicit conversion. Now it wouldn't work anymore because it would require two implicit conversions (to `Tensor` and then to `c10::optional<Tensor>`) and C++ doesn't do that. The main reasons for doing this are - Make the C++ API more sane. Those arguments are optional and that should be visible from the signature. - Allow easier integration for XLA and Autocast. Those backends generate code to wrap operators and forward operator arguments to calls to at::op(). After https://github.com/pytorch/pytorch/pull/41610, there was a mismatch because they had to implement operators with `optional<Tensor>` but call `at::op()` with `Tensor`, so they had to manually convert between those. After this PR, they can just forward the `optional<Tensor>` in their call to `at::op()`. ghstack-source-id: 108873705 Test Plan: unit tests Reviewed By: bhosmer Differential Revision: D22704832 fbshipit-source-id: f4c00d457b178fbc124be9e884a538a3653aae1f	2020-07-31 16:11:55 -07:00
Vinnam Kim	825a387ea2	Fix bug on the backpropagation of LayerNorm when create_graph=True (#41595 ) Summary: Solve an issue https://github.com/pytorch/pytorch/issues/41332 I found the bug at https://github.com/pytorch/pytorch/issues/41332 is caused by LayerNorm. Current implementations of LayerNorm have a disparity between 1. [`create_graph=False` CUDA implementation](`dde3d5f4a8/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L145)`) 2. [`create_graph=True` implementation](`dde3d5f4a8/tools/autograd/templates/Functions.cpp (L2536)`) With this bug-fix, https://github.com/pytorch/pytorch/issues/41332 is solved. Ailing BIT-silence Signed-off-by: Vinnam Kim <vinnamkim@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/41595 Reviewed By: houseroad Differential Revision: D22598415 Pulled By: BIT-silence fbshipit-source-id: 63e390724bd935dc8e028b4dfb75d34a80558c3a	2020-07-22 00:19:12 -07:00
Xiaomeng Yang	80d5b3785b	Add torch.logit function (#41062 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062 Add torch.logit function Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit" Reviewed By: hl475 Differential Revision: D22406912 fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606	2020-07-13 19:33:20 -07:00
Kurt Mohler	bba30d1bd8	Add undefined tensor gradient support to all backward functions (#39400 ) Summary: Adds the ability for all backward functions to accept undefined output gradient arguments. An undefined gradient is a Tensor that was created by the argumentless constructor `at::Tensor()`, where `tensor.defined() == false`. Also adds new autograd nodes, UndefinedGrad and UndefinedGradBackward, that can be used from within Python code to inject undefined gradients into a backward function. A new test case is added to the backward function unit tests to use the UndefinedGrad node to ensure that undefined gradients do not break any backward functions. Closes https://github.com/pytorch/pytorch/issues/33138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39400 Differential Revision: D21936588 Pulled By: albanD fbshipit-source-id: eccc5f55c77babe6dadcea4249d0c68a3c64e85d	2020-06-08 14:13:53 -07:00
Xiaomeng Yang	03eca384fd	Optimize GroupNorm on CPU (#28203 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28203 Optimize GroupNorm on CPU ghstack-source-id: 105149765 Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm" Reviewed By: houseroad Differential Revision: D17901506 fbshipit-source-id: 5eb22ad0e8a9ab2533282b967b2818f690e48865	2020-06-03 23:52:16 -07:00
Aayush Naik	0829cadca3	Implement rad2deg, deg2rad (#38852 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/38372. cc mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/38852 Differential Revision: D21868935 Pulled By: mruberry fbshipit-source-id: ae6ded11b743c9d1cdc032984b4abe0a115290d6	2020-06-03 22:21:54 -07:00
kshitij12345	3487744821	Add `torch.logcumsumexp` (#36308 ) Summary: Creating new PR as I am unable to push to pandeykartikey 's branch as I don't have the permissions. Closes https://github.com/pytorch/pytorch/issues/26411 Based on https://github.com/pytorch/pytorch/issues/32876 Thanks pandeykartikey for starting this out. Have addressed the comments. anjali411 agadetsky albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/36308 Differential Revision: D21648573 Pulled By: albanD fbshipit-source-id: bc1a8fc4ab474a1148298117a1549b0e46f7c3ff	2020-05-21 09:12:31 -07:00
Nik Ved	a7c29dbfa2	`unfold_backward` gets its own kernel (#36612 ) Summary: `unfold_backward` uses `index_add` which causes regression on CUDA because of the underlying `atomicAdd`, and regression on CPU because of limited parallelization. This PR attempts to replace `index_add` with a custom kernel. Fixes [https://github.com/pytorch/pytorch/issues/17501](https://github.com/pytorch/pytorch/issues/17501). Pull Request resolved: https://github.com/pytorch/pytorch/pull/36612 Differential Revision: D21450349 Pulled By: albanD fbshipit-source-id: 09ec1fbd5d7290656700eca8e7fb7cf52323ec28	2020-05-08 13:18:36 -07:00
Nik Ved	8434247653	modify `select_equals_backward` to propage only to a single value (#36316 ) Summary: Renames `select_equals_backward` to `select_first_equal_backward` and makes sure it propagates to a single value. Fixes [https://github.com/pytorch/pytorch/issues/35699](https://github.com/pytorch/pytorch/issues/35699). Pull Request resolved: https://github.com/pytorch/pytorch/pull/36316 Differential Revision: D21403848 Pulled By: albanD fbshipit-source-id: b260cd79289162ee5733887d2afe8203945baee6	2020-05-06 10:50:24 -07:00
Emilio Castillo	25ba802ce4	Fix `cdist` backward calculation for `p=2` (#37337 ) Summary: Closes https://github.com/pytorch/pytorch/issues/37154 Fixes a bug in `cdist` backward with `p=2`. Under some circumstances, if the output has 0s, the gradient calculation of `sqrt` will be undefined. Leading to NaNs in the input gradients. This PR defines a subgradient for this case. A test is also added to verify this behavior, I was only able to reproduce it under certain shapes, so the shape is explicitly taken from https://github.com/pytorch/pytorch/issues/37154 example Pull Request resolved: https://github.com/pytorch/pytorch/pull/37337 Differential Revision: D21403178 Pulled By: albanD fbshipit-source-id: deef9678c1958524b552504920f19617f9ad1da6	2020-05-05 14:13:37 -07:00
Nikita Shulga	a5af478f29	Use full include path in autogenerated Functions.cpp (#35924 ) Summary: Preliminary step to merge https://github.com/pytorch/pytorch/pull/35220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35924 Test Plan: CI Differential Revision: D20832159 Pulled By: malfet fbshipit-source-id: 29ff2e3c04c08c39c49f35414f94b76f0651859a	2020-04-02 22:46:09 -07:00
Nik Ved	35cdb78522	Make kl_div accept target in log space (#34586 ) Summary: Fixes [32520](https://github.com/pytorch/pytorch/issues/32520), implements [34536](https://github.com/pytorch/pytorch/issues/34536). Here are some benchmarks: ```python import torch import torch.nn.functional as F from IPython import get_ipython ipython = get_ipython() torch.set_num_threads(1) for d in [5, 10, 20, 50, 100, 1000]: i = torch.rand(d, d) t = torch.rand(d, d) print(f"Size: {d}x{d}") ipython.magic("timeit F.kl_div(i, t, reduction='none', log_target=False)") ipython.magic("timeit F.kl_div(i, t.log(), reduction='none', log_target=True)") ``` Output: ``` Size: 5x5 16 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 8.24 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 10x10 16.7 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 8.7 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 20x20 17.7 µs ± 47.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 9.7 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 50x50 23.6 µs ± 60.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 15 µs ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 100x100 42.8 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Size: 1000x1000 3.9 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 3.45 ms ± 364 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/34586 Differential Revision: D20652726 Pulled By: ezyang fbshipit-source-id: 480697b4cd01341bbeee7514a8b812705a0600ea	2020-04-01 12:26:58 -07:00
Michael Ranieri	51d969e86a	preprocessor cleanup (#33957 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33957 lots of small preprocessor warning cleanup for windows Test Plan: CI green Reviewed By: malfet, albanD Differential Revision: D20153582 fbshipit-source-id: 18fd61c466fd1f55ededdae4448b3009a9cedc04	2020-03-02 13:37:19 -08:00
mfkasim91	9d94f56ce0	Backward operation of torch.eig for real eigenvalues (#33090 ) Summary: Another pull request to follow up issue https://github.com/pytorch/pytorch/issues/32531. Here I implemented the backward operation for `torch.eig` with a condition that all the eigenvalues are real. This pull request is independent of my another pull request https://github.com/pytorch/pytorch/issues/32932, which means that there is no dependency between this PR and my another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33090 Differential Revision: D19814347 Pulled By: albanD fbshipit-source-id: 2fae30964e97987abb690544df8240aedeae56e8	2020-02-10 09:52:56 -08:00
anjali411	5b815d980e	Added cummin Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32238 Differential Revision: D19416791 Pulled By: anjali411 fbshipit-source-id: 5aadc0a7a55af40d76f444ab7d7d47ec822f55a5	2020-01-17 10:51:58 -08:00
anjali411	8dc67a014f	Add cummax Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32169 Differential Revision: D19393236 Pulled By: anjali411 fbshipit-source-id: 5dac6b0a4038eb48458d4a0b253418daeccbb6bc	2020-01-14 17:19:10 -08:00
Alexander Golynski	b783a75aa3	Fix scalar^tensor derivative for scalars that are zero Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32063 Test Plan: Imported from OSS Differential Revision: D19394258 Pulled By: agolynski fbshipit-source-id: 3eed0f9cc1b8c677c6948c927d007044be67fe7f	2020-01-14 11:11:23 -08:00
Alexander Golynski	fa60e1150d	Fix tensor^tensor derivative for 0 base entries Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32062 Test Plan: Imported from OSS Differential Revision: D19394259 Pulled By: agolynski fbshipit-source-id: 836525e03573af838511ad5b4cc87ec2c1536a5e	2020-01-14 11:10:25 -08:00
leetanenbaum	5988d36f58	Fix cumprod error for tensors with zero elements (#32070 ) Summary: Currently cumprod crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both dim() and numel() in cumprod backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/32070 Differential Revision: D19373200 Pulled By: ezyang fbshipit-source-id: d8ecde33f3330b40a7c611f6faa3b1d707ef2a9a	2020-01-13 09:50:27 -08:00
leetanenbaum	0b9cd410a9	Fix cumsum error for tensors with zero elements (#31694 ) Summary: Currently `cumsum` crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both `dim()` and `numel()` in cumsum backward Fixes https://github.com/pytorch/pytorch/issues/31515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31694 Reviewed By: mrshenli Differential Revision: D19266613 Pulled By: leedtan fbshipit-source-id: 9407e0aa55440fed911c01a3580bb6c5eab62a16	2020-01-03 10:16:46 -08:00
Vitaly Fedyunin	e46babb637	explicitly provide memory format when calling to *_like operators Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30007 Test Plan: Imported from OSS Differential Revision: D18575982 Pulled By: VitalyFedyunin fbshipit-source-id: 83be0857fe1080216cd09547a2b3d34455a0cce4	2019-11-19 16:19:24 -08:00
Vitaly Fedyunin	dc9e7b73e1	explicitly provide memory format when calling to *_like operators (Redo of `e3e06549`) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30004 Test Plan: Imported from OSS Differential Revision: D18575977 Pulled By: VitalyFedyunin fbshipit-source-id: 344e9a11c93c7e4a822f424c94fa2255592d118e	2019-11-19 16:19:11 -08:00
Vitaly Fedyunin	20b73e1805	explicitly provide memory format when calling to *_like operators (Redo of 631b22d) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30001 Test Plan: Imported from OSS Differential Revision: D18575979 Pulled By: VitalyFedyunin fbshipit-source-id: d6fe8a6e1b45673f85a0dd49bd6becfadc5091b4	2019-11-19 16:18:58 -08:00
Igor Fedan	75309b45f3	explicitly provide memory format when calling to clone() at Indexing.cpp Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28660 Test Plan: Imported from OSS Differential Revision: D18333346 Pulled By: ifedan fbshipit-source-id: 06590205d883a5096388a4ae318389244130972d	2019-11-07 05:38:32 -08:00
Vitaly Fedyunin	e3e06549c1	Autogenerated contiguous memory format for old *_like calls Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29225 Test Plan: Imported from OSS Differential Revision: D18330964 Pulled By: VitalyFedyunin fbshipit-source-id: f357a0cc125bd90a62575bd461722b9e36e75cbf	2019-11-06 07:24:34 -08:00
Vitaly Fedyunin	d410fc5a81	Autogenerated contiguous memory format for old *_like calls Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29222 Test Plan: Imported from OSS Differential Revision: D18330966 Pulled By: VitalyFedyunin fbshipit-source-id: 9e8da4e826cc43fac9828737ef744606491812a4	2019-11-06 07:24:21 -08:00
Xiaomeng Yang	2460dced8f	Add torch.nn.GELU for GELU activation (#28944 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28944 Add torch.nn.GELU for GELU activation Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GELU" Reviewed By: hl475, houseroad Differential Revision: D18240946 fbshipit-source-id: 6284b30def9bd4c12bf7fb2ed08b1b2f0310bb78	2019-11-03 21:55:05 -08:00
titaneric	82f31e02a3	Remove the redundant calculation of derivative of power function (#28651 ) Summary: Hi, I notice that the pytorch faced the the issue as HIPS/autograd#541 . I try to solve it, hope it can help. Pull Request resolved: https://github.com/pytorch/pytorch/pull/28651 Reviewed By: gchanan Differential Revision: D18137163 Pulled By: albanD fbshipit-source-id: 888bef65c72c4c15c2acdd4b13d5041008b1354e	2019-10-28 08:37:04 -07:00
Igor Fedan	bc57967e07	max_pool2d cuda should have channel last optimized kernels[Performance improvement] (#24872 ) Summary: max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(https://github.com/pytorch/pytorch/issues/23815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/24872 Differential Revision: D16964577 Pulled By: ifedan fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb	2019-10-21 11:28:12 -07:00
Lu Fang	e9a91756cd	Back out "[pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU)" Summary: Original commit changeset: 9ddffe4dbbfa Test Plan: ci Reviewed By: yf225 Differential Revision: D17939581 fbshipit-source-id: 44a3b843bf1e7059fec57b9e3d12ed4886816145	2019-10-15 21:12:10 -07:00
Edward Yang	2aa84d927b	Revert D17939700: Revert D17889288: [pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU) Test Plan: revert-hammer Differential Revision: D17939700 Original commit changeset: 4fc6156ba388 fbshipit-source-id: dded0a2140d2c14cd2f2a574987ecc164b0e5bfe	2019-10-15 15:24:36 -07:00
Edward Yang	c44e33b578	Revert D17889288: [pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU) Test Plan: revert-hammer Differential Revision: D17889288 Original commit changeset: 9ddffe4dbbfa fbshipit-source-id: 4fc6156ba38834512b2f735ac0d03e34e69b7286	2019-10-15 14:35:28 -07:00
Thomas Viehmann	f461184505	Use grad_out for cudnn CTC loss (#27039 ) Summary: Using grad_out for CuDNN CTC loss fixes: https://github.com/pytorch/pytorch/issues/26797, https://github.com/pytorch/pytorch/issues/25833. We also fix a cudnn incompatible change that surfaced during the testing: As of CuDNN 7.6 the semantics of the CTC loss gradients are different. This leads us to disable CuDNN CTC for CuDNN < 7.6. To mitigate the impact on users, we convert the parameters for the native implementation if CuDNN isn't applicable (previously this would give an error.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/27039 Differential Revision: D17910815 Pulled By: ngimel fbshipit-source-id: 465b33612d3402f10c355aa7026a7e1ffaef3073	2019-10-15 11:36:37 -07:00
Divyansh Singhvi	3397d41b8a	Wrapping namespace Reduction in namespace at (#26606 ) (#27422 ) Summary: 1) Wrapped namespace `Reduction` in namespace `at` 2) Prefixed `at::` wherever `Reduction::` is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/27422 Differential Revision: D17913759 Pulled By: yf225 fbshipit-source-id: 8f00ca01cad2e7f673d316b128abf59c026e216c	2019-10-15 11:05:40 -07:00
Andreas Koepf	9033ace9c4	Migrate soft_margin_loss from the TH to Aten (CUDA+CPU) (#27673 ) Summary: Replaces fused TH kernels with a 2-liner of regular Tensor functions. Benchmarking revealed that performance improves compared to PyTorch 1.2. Refs: https://github.com/pytorch/pytorch/issues/24631, https://github.com/pytorch/pytorch/issues/24632, https://github.com/pytorch/pytorch/issues/24764, https://github.com/pytorch/pytorch/issues/24765 VitalyFedyunin ### Benchmarking results on my laptop: ## 1.4.0a0+f63c9e8 output ``` PyTorch version: 1.4.0a0+f63c9e8 CPU Operator sanity check: tensor(0.5926, grad_fn=<MeanBackward0>) tensor([-0.0159, -0.0170, -0.0011, -0.0083, -0.0140, -0.0217, -0.0290, -0.0262, -0.0078, -0.0129]) double backward tensor(-0.1540, grad_fn=<SumBackward0>) ok GPU Operator sanity check: tensor(0.5601, device='cuda:0', grad_fn=<MeanBackward0>) tensor([-0.0393, -0.0316, -0.0233, -0.0140, -0.0141, -0.0161, -0.0322, -0.0238, -0.0054, -0.0151], device='cuda:0') double backward tensor(-0.2148, device='cuda:0', grad_fn=<SumBackward0>) ok CPU warmup 1000 took 9.025700273923576e-05 CPU warmup 10000 took 0.0009383050055475906 CPU warmup 100000 took 0.0015631120040779933 CPU warmup TOTAL time 0.0026368020044174045 CPU forward 1000 took 6.919399311300367e-05 CPU forward 10000 took 0.00014462800754699856 CPU forward 100000 took 0.0011234670091653243 CPU forward 1000000 took 0.014555767003912479 CPU forward 10000000 took 0.13409724000666756 CPU forward 100000000 took 1.246048310000333 CPU forward TOTAL time 1.3961777170043206 CPU for- & backward 1000 took 0.0003219560021534562 CPU for- & backward 10000 took 0.00037290599721018225 CPU for- & backward 100000 took 0.001975035003852099 CPU for- & backward 1000000 took 0.02621342398924753 CPU for- & backward 10000000 took 0.2944270490115741 CPU for- & backward 100000000 took 1.6856628700043075 CPU for- & backward TOTAL time 2.0091958299890393 GPU warmup 1000 took 0.0002462909906171262 GPU warmup 10000 took 9.991199476644397e-05 GPU warmup 100000 took 0.00034347400651313365 GPU warmup TOTAL time 0.0007382350013358518 GPU forward 1000 took 9.67290106927976e-05 GPU forward 10000 took 9.349700121674687e-05 GPU forward 100000 took 9.384499571751803e-05 GPU forward 1000000 took 0.0004975290066795424 GPU forward 10000000 took 0.0017606960027478635 GPU forward 100000000 took 0.003572814996005036 GPU forward TOTAL time 0.006185991995153017 GPU for- & backward 1000 took 0.00035818999458570033 GPU for- & backward 10000 took 0.0003240450023440644 GPU for- & backward 100000 took 0.0003223370003979653 GPU for- & backward 1000000 took 0.00036740700306836516 GPU for- & backward 10000000 took 0.0003690610028570518 GPU for- & backward 100000000 took 0.0003672500024549663 GPU for- & backward TOTAL time 0.002197896988946013 ``` ## 1.2 output ``` PyTorch version: 1.2.0 CPU Operator sanity check: tensor(0.5926, grad_fn=<SoftMarginLossBackward>) tensor([-0.0159, -0.0170, -0.0011, -0.0083, -0.0140, -0.0217, -0.0290, -0.0262, -0.0078, -0.0129]) double backward tensor(-0.1540, grad_fn=<SumBackward0>) ok GPU Operator sanity check: tensor(0.5601, device='cuda:0', grad_fn=<SoftMarginLossBackward>) tensor([-0.0393, -0.0316, -0.0233, -0.0140, -0.0141, -0.0161, -0.0322, -0.0238, -0.0054, -0.0151], device='cuda:0') double backward tensor(-0.2148, device='cuda:0', grad_fn=<SumBackward0>) ok CPU warmup 1000 took 8.422900282312185e-05 CPU warmup 10000 took 0.00036992700188420713 CPU warmup 100000 took 0.003682684007799253 CPU warmup TOTAL time 0.004169487991021015 CPU forward 1000 took 5.521099956240505e-05 CPU forward 10000 took 0.00036948200431652367 CPU forward 100000 took 0.003762389998883009 CPU forward 1000000 took 0.03725024699815549 CPU forward 10000000 took 0.3614480490068672 CPU forward 100000000 took 3.6139175269927364 CPU forward TOTAL time 4.016912263003178 CPU for- & backward 1000 took 0.0002734809968387708 CPU for- & backward 10000 took 0.0006605249946005642 CPU for- & backward 100000 took 0.005437346000690013 CPU for- & backward 1000000 took 0.051245586000732146 CPU for- & backward 10000000 took 0.5291594529990107 CPU for- & backward 100000000 took 5.23841712900321 CPU for- & backward TOTAL time 5.8253340990049765 GPU warmup 1000 took 0.0005757809994975105 GPU warmup 10000 took 0.0004058420017827302 GPU warmup 100000 took 0.0003764610009966418 GPU warmup TOTAL time 0.0013992580061312765 GPU forward 1000 took 0.0003543390048434958 GPU forward 10000 took 0.0003633670130511746 GPU forward 100000 took 0.0004807310033356771 GPU forward 1000000 took 0.0005875999922864139 GPU forward 10000000 took 0.0016903509967960417 GPU forward 100000000 took 0.014400018990272656 GPU forward TOTAL time 0.0179396449966589 GPU for- & backward 1000 took 0.0006167769897729158 GPU for- & backward 10000 took 0.0006845899915788323 GPU for- & backward 100000 took 0.000631830989732407 GPU for- & backward 1000000 took 0.0010741150035755709 GPU for- & backward 10000000 took 0.0017265130009036511 GPU for- & backward 100000000 took 0.014847910992102697 GPU for- & backward TOTAL time 0.01965981800458394 ``` ### Code used for performance test ``` import torch import torch.nn.functional as F import torch.nn as nn from timeit import default_timer torch.manual_seed(0) cpu = torch.device('cpu') gpu = torch.device('cuda') loss_fn = F.soft_margin_loss def run_benchmark(name, depth, require_grad, device, fn): total_start = default_timer() for i in range(3, 3 + depth): start = default_timer() n = 10 ** i a = torch.rand(n, requires_grad=require_grad, device=device) b = torch.rand(n, device=device) fn(a, b) end = default_timer() print('{} {} took {}'.format(name, n, end-start)) total_end = default_timer() print('{} TOTAL time {}'.format(name, total_end-total_start)) def fwd_only(a, b): out = loss_fn(a, b) def fwd_bck(a, b): out = loss_fn(a, b) out.backward() def sanity_check(name, device): print('{} Operator sanity check:'.format(name)) a = torch.rand(10, requires_grad=True, device=device) b = torch.rand(10, device=device) out = loss_fn(a,b) print(out) out.backward() print(a.grad) print('double backward') loss = loss_fn(a, b) loss2 = torch.autograd.grad(loss, a, create_graph=True) z = loss2[0].sum() print(z) z.backward() print('ok') print() print('PyTorch version:', torch.__version__) sanity_check('CPU', cpu) sanity_check('GPU', gpu) print() run_benchmark('CPU warmup', 3, False, cpu, fwd_only) run_benchmark('CPU forward', 6, False, cpu, fwd_only) run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck) print() run_benchmark('GPU warmup', 3, False, gpu, fwd_only) run_benchmark('GPU forward', 6, False, gpu, fwd_only) run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/27673 Differential Revision: D17889288 Pulled By: ezyang fbshipit-source-id: 9ddffe4dbbfab6180847a8fec32443910f18f0a9	2019-10-15 08:44:57 -07:00
Xiaomeng Yang	8b87f9a510	Add fused layer norm impl on CUDA in PyTorch (#27634 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27634 Add fused layer norm impl on CUDA in PyTorch Performance benchmark compare to apex.FusedLayerNorm on a V100 machine. ************************************ Shape = (128, `2097152`) curr LayerNorm forward: 7.252584544941783ms apex LayerNorm forward: 10.366813436849043ms curr LayerNorm backward: 15.568048988003284ms apex LayerNorm backward: 20.869979876093566ms ********************************** Shape = (256, 1048576) curr LayerNorm forward: 5.185673736967146ms apex LayerNorm forward: 6.3868385690730065ms curr LayerNorm backward: 13.942008479032665ms apex LayerNorm backward: 15.469660016940907ms ********************************** Shape = (512, 524288) curr LayerNorm forward: 4.672068868065253ms apex LayerNorm forward: 4.717993081081659ms curr LayerNorm backward: 13.46354596503079ms apex LayerNorm backward: 14.04774487693794ms ********************************** Shape = (1024, 262144) curr LayerNorm forward: 4.547273400006816ms apex LayerNorm forward: 5.378365494078025ms curr LayerNorm backward: 13.425063178874552ms apex LayerNorm backward: 14.235145597020164ms ********************************** Shape = (2048, 131072) curr LayerNorm forward: 4.526399010093883ms apex LayerNorm forward: 4.775081946980208ms curr LayerNorm backward: 13.222738380078226ms apex LayerNorm backward: 13.59594238596037ms ********************************** Shape = (4096, 65536) curr LayerNorm forward: 4.28789056581445ms apex LayerNorm forward: 4.48913648002781ms curr LayerNorm backward: 13.026655421825126ms apex LayerNorm backward: 13.57052089786157ms ********************************** Shape = (8192, 32768) curr LayerNorm forward: 4.243518367875367ms apex LayerNorm forward: 4.34588153520599ms curr LayerNorm backward: 13.140627697808668ms apex LayerNorm backward: 13.49891544203274ms ********************************** Shape = (16384, 16384) curr LayerNorm forward: 4.181216162163764ms apex LayerNorm forward: 4.268723972840235ms curr LayerNorm backward: 13.035593512002379ms apex LayerNorm backward: 13.463351831072941ms ************************************ Shape = (32768, 8192) curr LayerNorm forward: 4.097899778978899ms apex LayerNorm forward: 4.109480210812762ms curr LayerNorm backward: 13.041268918896094ms apex LayerNorm backward: 13.586135944118723ms Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "LayerNorm" Reviewed By: houseroad Differential Revision: D17462420 fbshipit-source-id: d4a67d160bb4eff73ffac64af46c56c3845cf211	2019-10-14 21:26:33 -07:00
Anjali Chourdia	da669c25ee	autograd: double backwards function for binary_cross_entropy loss Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26983 Reviewed By: albanD Differential Revision: D17714357 Pulled By: anjali411 fbshipit-source-id: cebfe09a9048c4be457b7f2718bc396c06ecabee	2019-10-04 08:29:22 -07:00
Vishwak Srinivasan	c643290982	Add derivative for cholesky_inverse (#26451 ) Summary: Changelog: - Add derivative of cholesky_inverse. The equations are derived akin to the derivative of solve methods using the technique detailed [here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwiXrOjIyM7kAhWstlkKHRxqCDgQFjAAegQIAhAC&url=https%3A%2F%2Fpeople.maths.ox.ac.uk%2Fgilesm%2Ffiles%2FNA-08-01.pdf&usg=AOvVaw0BNISOvM_I9KjPrl0xv1R_) Pull Request resolved: https://github.com/pytorch/pytorch/pull/26451 Test Plan: - Added tests for cholesky_inverse in test_autograd.py Closes https://github.com/pytorch/pytorch/issues/4669. Differential Revision: D17548526 Pulled By: ezyang fbshipit-source-id: 51aa8b900a8dc4012b01a73d432606f216f62c9d	2019-09-24 07:12:41 -07:00
Edward Yang	9b7011c5c2	Implement multiple dispatch (#26468 ) (#26501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D17499154 Pulled By: ezyang fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c	2019-09-20 10:12:04 -07:00

1 2 3 4 5

225 Commits