pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Brian Vaughan	a376dd344c	Added check for torch.where on CPU that both arguments have same dtype (#30662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30662 Cherry picked from: https://github.com/pytorch/pytorch/pull/29081 Test Plan: Imported from OSS Differential Revision: D18782295 Pulled By: nairbv fbshipit-source-id: 897ab25ddf8819ca34f5e86c5d3f41debb56cb04 Co-authored-by: ifedan	2019-12-03 15:19:52 -08:00
Brian Wignall	e7fe64f6a6	Fix typos (#30606 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606 Differential Revision: D18763028 Pulled By: mrshenli fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c	2019-12-02 20:17:42 -08:00
Peter Bell	37ca5a8a64	convert_sync_batchnorm should not convert _InstanceNorm instances (#29985 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/29187 This introduces a new class `_NormBase` that `_InstanceNorm` and `_BatchNorm` inherit from separately. This means the `isinstance(module, _BatchNorm)` check won't falsely pass for `_InstanceNorm`. The suggested fix of adding `and not isinstance(module, _InstanceNorm)` works as well, but requires introducing a cyclic dependency between `instancenorm.py` and `batchnorm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29985 Differential Revision: D18588104 Pulled By: yf225 fbshipit-source-id: f599da3b902ad9c56836db4d429bfc462ed51338	2019-11-19 09:39:36 -08:00
Natalia Gimelshein	a9ad2e2f00	fix batch norm for empty inputs (#30035 ) Summary: Fix for https://github.com/pytorch/pytorch/issues/29578 Shape check is moved up as much as possible, because backends by and large don't correctly handle empty inputs, so check needs to be done before backend selection. That also automatically takes care of backward, because forward for empty input is automatically differentiable, so no backend-specific backward routines are ever called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30035 Test Plan: tests for empty inputs are added. Differential Revision: D18584427 Pulled By: ngimel fbshipit-source-id: a42918f50eb1f6995921aafa92879cd42dd5e9e1	2019-11-18 23:08:12 -08:00
Jie	c5ac70a0ea	AdaptiveAvgPooling nhwc cuda update (#29700 ) Summary: 1. Add clip on grid launch configs (Tests added in test_nn.py) 2. Assert on shared memory requirement, gives better hint when error out; Pull Request resolved: https://github.com/pytorch/pytorch/pull/29700 Differential Revision: D18482556 Pulled By: VitalyFedyunin fbshipit-source-id: df3f653185d7b477b2241f2ef4779670e9a78899	2019-11-14 11:02:48 -08:00
Ashkan Aliabadi	9ee6fa0145	Use NNPACK for strided convolutions. (#29595 ) Summary: Use NNPACK for strided convolutions. ResNet50 on Pixel 3: - Before: 552.956 ms - After: 402.947 ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/29595 Reviewed By: houseroad Differential Revision: D18457472 Pulled By: AshkanAliabadi fbshipit-source-id: 51f22ce120c39f197cd564bcc71bbad2951edf85	2019-11-13 17:10:41 -08:00
Lu Fang	466ab93ef5	Revert D18286473: Use NNPACK for strided convolutions. Test Plan: revert-hammer Differential Revision: D18286473 Original commit changeset: accdfafa2c24 fbshipit-source-id: dc1347eb2738009c7f44699fc46b6cb80c54e2e3	2019-11-10 08:11:11 -08:00
Ashkan Aliabadi	5ba9209755	Use NNPACK for strided convolutions. (#29084 ) Summary: Use NNPACK for strided convolutions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29084 Differential Revision: D18286473 Pulled By: AshkanAliabadi fbshipit-source-id: accdfafa2c247f2750208a7af84c9e2c0374920b	2019-11-09 21:21:55 -08:00
Michela Paganini	8e8a5e0664	Pruning Functionality (#24076 ) Summary: Provides implementation for feature request issue https://github.com/pytorch/pytorch/issues/20402. Adds pruning functionalities (structured and unstructured, local and global, as well as pruning from user-provided mask). Associated tutorial here: https://github.com/pytorch/tutorials/pull/605 cc: soumith Pull Request resolved: https://github.com/pytorch/pytorch/pull/24076 Differential Revision: D18400431 Pulled By: mickypaganini fbshipit-source-id: a97bd6ca61f8600ae411da9ff6533c232aae1a51	2019-11-08 19:38:00 -08:00
Xiang Gao	02921e7985	Use cuDNN's handle pool mechanism to manage cublas handles (#29233 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/6962 The PR implements the handle pool mechanism for cublas as suggested by mcarilli in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872. ~~I didn't add any unit test here yet because as mcarilli mentioned:~~ > ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~ ~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~ cc: colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233 Differential Revision: D18372007 Pulled By: ezyang fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d	2019-11-07 12:50:18 -08:00
Jie	fdab1cf0d4	NHWC support in cuDNN BatchNorm & Conv2d (#29361 ) Summary: This reverts the `9a9bb448ee` Fixing the broken case which reverts the previous commit. details about fix: modified: aten/src/ATen/native/Convolution.cpp called contiguous on 3D input tensor. This avoids the code path to accidentally recognize the input as channel_last stride, due to unsqueezing of permuted 3d tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29361 Differential Revision: D18371964 Pulled By: VitalyFedyunin fbshipit-source-id: a5985f4687b37e183649fa35b8ccdb50368ebfdf	2019-11-07 10:39:58 -08:00
Vitaly Fedyunin	9a9bb448ee	Revert cudnn changes #23861 (#29329 ) Summary: Broken case: ```python x = torch.randn(192,16,50).cuda() x = x.permute(0,2,1).contiguous().permute(0,2,1) m = torch.nn.Conv1d( in_channels=16, out_channels=32, kernel_size=2, bias=True, ).cuda() m(x) ``` This reverts commit `8160f390cf`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29329 Differential Revision: D18357674 Pulled By: VitalyFedyunin fbshipit-source-id: cdd7e77e8dcbfc5f2ab3df54eb53ccfbf703b245	2019-11-06 17:38:46 -08:00
xiaobing.zhang	e01324d058	Port l1_loss to Aten (#26795 ) Summary: VitalyFedyunin, This PR is about port L1 lose to Aten: Test script: ``` import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): if torch.cuda.is_available(): torch.cuda.synchronize() return time.time() device = "cpu" loss = nn.L1Loss(reduction = 'sum') if torch.cuda.is_available(): device = "cuda" loss = loss.cuda() #warm up for n in [100, 10000]: input = torch.randn(128, n, requires_grad=True, device=device) target = torch.randn(128, n, device=device) for i in range(1000): output = loss(input, target) output.backward() #get running time for n in [100, 10000]: fwd_t = 0 bwd_t = 0 input = torch.randn(128, n, requires_grad=True, device=device) target = torch.randn(128, n, device=device) for i in range(10000): t1 = _time() output = loss(input, target) t2 = _time() output.backward() t3 = _time() fwd_t = fwd_t + (t2 -t1) bwd_t = bwd_t + (t3 - t2) fwd_avg = fwd_t / 10000 * 1000 bwd_avg = bwd_t / 10000 * 1000 print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)." % (n, fwd_avg, bwd_avg)) ``` Test Device: CPU: skx-8180, GPU: Tesla P100. Perfromance: Before: ``` GPU: reduction=’mean’ nput size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.09 (ms). input size(128, 10000) forward time is 0.33 (ms); backwad avg time is 0.14 (ms). reduction=’sum’ input size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.10 (ms). input size(128, 10000) forward time is 0.34 (ms); backwad avg time is 0.14 (ms). CPU: reduction=’mean’ input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.10 (ms). input size(128, 10000) forward time is 1.92 (ms); backwad avg time is 2.96 (ms). reduction=’sum’ input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.09 (ms). input size(128, 10000) forward time is 1.96 (ms); backwad avg time is 2.79 (ms). nume_thread = 1: reduction=’mean’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.50 (ms). reduction=’sum’: input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.51 (ms). ``` After: ``` GPU: reduction=’mean’ input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.10 (ms). input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.17 (ms). reduction=’sum’ input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.08 (ms). input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.16 (ms). CPU: reduction=’mean’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.18 (ms). reduction=’sum’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 0.15 (ms); backwad avg time is 0.17 (ms). nume_thread = 1: reduction=’mean’: input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms). input size(128, 10000) forward time is 1.05 (ms); backwad avg time is 1.72 (ms). reduction=’sum’: input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 1.03 (ms); backwad avg time is 1.71 (ms). ``` How to set number thread? using following script: ``` num_threads=$1 script=$2 last_core=`expr $num_threads - 1` echo "using $num_threads OMP threads" echo "bind cores to 0~$last_core" export OMP_NUM_THREADS=$num_threads export KMP_AFFINITY=granularity=fine,compact,1,0 numactl --physcpubind=0-$last_core --membind=0 python $script ``` and run `./run.sh 1 L1loss.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26795 Differential Revision: D18140434 Pulled By: VitalyFedyunin fbshipit-source-id: d0b976ec36797f2e6b4e58fbbac89688d29e736f	2019-11-04 13:20:07 -08:00
Jie	8160f390cf	(#23861 ) Summary: Added nhwc support for: 1. cudnn_batch_norm & cudnn_batch_norm_backward 2. cudnn_convolution_forward & cudnn_convolution_backward 3. cudnn_convolution_transpose & cudnn_convolution_transpose_backward patching suggest_memory_format for convolution suggest_memory_format has ambiguous meaning for two cases: 1. tensor with NCHW where C = 1. we could use stride of C as a hint to tell the intended memory format. 2. tensor with NCHW where H == W == 1. there's no way to identify the intended memory format from strides. Currently we fallback to NCHW whenever we see contiguous tensor. Hence avoiding ambiguity for some of the special cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23861 Differential Revision: D18263434 Pulled By: VitalyFedyunin fbshipit-source-id: dd9f69576ec12fec879cd87a3d446931371360d9	2019-11-04 09:11:50 -08:00
Jie	70f3f23e3a	(#29016 ) Summary: Adding limitation on launch config for grid size Test added in test_cuda; Pull Request resolved: https://github.com/pytorch/pytorch/pull/29016 Differential Revision: D18293788 Pulled By: ngimel fbshipit-source-id: 44de308b05a4fe44bfffc2f3713fd9fa67ef74fa	2019-11-04 08:50:18 -08:00
jokerkeny	aa30176c68	Add C++ API clip_grad_value_ for nn:utils (#28736 ) Summary: Adds C++ API clip_grad_value_ for torch::nn:utils module. Also, fix the for indent level error in the original test/test_nn.py. Issue: https://github.com/pytorch/pytorch/issues/25883 Reviewer: yf225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/28736 Differential Revision: D18263807 Pulled By: yf225 fbshipit-source-id: 29282450bd2099df16925e1d0edd3d933f6eeb9b	2019-10-31 19:11:54 -07:00
Soumith Chintala	c63e15aef8	Revert D18241759: Test Plan: revert-hammer Differential Revision: D18241759 Original commit changeset: 8f2535bb0bc4 fbshipit-source-id: 870ac8e860e31f32138d42d470321e225a19990d	2019-10-31 07:54:26 -07:00
Jie	1b1e3d565c	(#28927 ) Summary: This is to fix https://github.com/pytorch/pytorch/issues/22526 Adding limitation on launch config for grid sizes as well, previous code is asking to launch blocks more than what's supported by the hardware; Test added in test_cuda; Pull Request resolved: https://github.com/pytorch/pytorch/pull/28927 Differential Revision: D18241759 Pulled By: soumith fbshipit-source-id: 8f2535bb0bc4ea7998024b137576a38067668999	2019-10-31 01:00:47 -07:00
Anjali Chourdia	efbaa8a563	added a check for zero stride Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28784 Differential Revision: D18178889 Pulled By: anjali411 fbshipit-source-id: 976810bf3f9def3a8f5ca6885b1e049b831f06f3	2019-10-29 12:08:38 -07:00
Jie	e263dd3853	(#24396 ) Summary: Initial kernel support added for optimized NHWC tensor. TODO: currently backwards kernel spits out tensor with NHWC stride. Unfortunately autograd restores grad to contiguous (in either copy or add). This makes real perf tuning annoying to do. (since I cannot easily measure end-to-end time in my python script) My current kernel is blazing fast comparing to the original NCHW kernel in fp16, since I avoided atomicAdd. I'll finish perf tuning after we merged some future PR expanding NHWC support in the core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24396 Differential Revision: D18115941 Pulled By: VitalyFedyunin fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834	2019-10-24 11:57:15 -07:00
Igor Fedan	bc57967e07	max_pool2d cuda should have channel last optimized kernels[Performance improvement] (#24872 ) Summary: max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(https://github.com/pytorch/pytorch/issues/23815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/24872 Differential Revision: D16964577 Pulled By: ifedan fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb	2019-10-21 11:28:12 -07:00
Pritam Damania	99271ad411	Split out data_parallel tests from test_nn.py into a separate (#28297 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28297 Splitting data parallel tests out of test_nn.py since its easier to manage and track these tests separately and failures can be routed to appropriate POCs. Test Plan: waitforbuildbot Differential Revision: D18011663 fbshipit-source-id: 17ebf7c04e7dc7ff4c8d38458daab5b911bed75d	2019-10-18 17:48:40 -07:00
davidriazati	2e7dd54796	Fix RNN nonlinearity (#28058 ) Summary: This was referenced in the `RNN` docs but wasn't actually assigned Pull Request resolved: https://github.com/pytorch/pytorch/pull/28058 Pulled By: driazati Differential Revision: D17945867 fbshipit-source-id: 0f0dc2633183a7e67a12352a2a7ac0545284666a	2019-10-17 16:46:09 -07:00
Mike Ruberry	8fff54ec39	Enables non-default CUDA stream in test_nn (#28192 ) Summary: Per title. Several stream fixes have gone in that may make this pass in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/28192 Differential Revision: D17974219 Pulled By: mruberry fbshipit-source-id: 543d000789c83711a8b4bef169a87635fda7508b	2019-10-17 10:19:49 -07:00
Thomas Viehmann	f461184505	Use grad_out for cudnn CTC loss (#27039 ) Summary: Using grad_out for CuDNN CTC loss fixes: https://github.com/pytorch/pytorch/issues/26797, https://github.com/pytorch/pytorch/issues/25833. We also fix a cudnn incompatible change that surfaced during the testing: As of CuDNN 7.6 the semantics of the CTC loss gradients are different. This leads us to disable CuDNN CTC for CuDNN < 7.6. To mitigate the impact on users, we convert the parameters for the native implementation if CuDNN isn't applicable (previously this would give an error.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/27039 Differential Revision: D17910815 Pulled By: ngimel fbshipit-source-id: 465b33612d3402f10c355aa7026a7e1ffaef3073	2019-10-15 11:36:37 -07:00
Ethan Steinberg	848d1ba13a	Fix padding_idx in the new embedding cuda kernel. (#27731 ) Summary: The current embedding backwards CUDA kernel is somewhat broken. It effectively ignores padding_idx and also incorrectly drops an index from the input. This commit fixes that bug and fixes the unit test so that this behavior won't break in the future. This fixes https://github.com/pytorch/pytorch/issues/26302. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27731 Differential Revision: D17893803 Pulled By: ngimel fbshipit-source-id: 4ba02a17ec0e29a7016d65480d4ff0c276550616	2019-10-13 21:18:49 -07:00
Mike Ruberry	f6bda1e07b	Removes @default_floating_dtype decorator (#27628 ) Summary: One fewer legacy decorator cluttering the test suite. Functions relying on this decorator were updated or, in the case of test_sparse, the test suite was put back on double by default. Note: this PR is blocked on https://github.com/pytorch/pytorch/issues/27599. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27628 Differential Revision: D17896254 Pulled By: mruberry fbshipit-source-id: 13d460301f50ef4af7a660372432108164c0de1f	2019-10-12 12:39:34 -07:00
Thomas Viehmann	e66e00cd17	Fix native ctc_loss gradient indexing bug for large target sizes (#27460 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/27442 Thank you Mohamed Yousef (ASDen) for the report with minimal reproducing example and detailed analysis! Pull Request resolved: https://github.com/pytorch/pytorch/pull/27460 Differential Revision: D17789378 Pulled By: soumith fbshipit-source-id: dc01a31b998cced4462e933d4b32e09b331f7e41	2019-10-09 19:26:47 -07:00
Guanheng Zhang	eb93200321	Fix DDP incompatibility issue with nn.MultiheadAttention. (#26826 ) Summary: Fix issue https://github.com/pytorch/pytorch/issues/26698. With different query/keys/value dimensions, `nn.MultiheadAttention` has DDP incompatibility issue because in that case `in_proj_weight` attribute is created but not used. Fix it and add a distributed unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26826 Differential Revision: D17583807 Pulled By: zhangguanheng66 fbshipit-source-id: c393584c331ed4f57ebaf2d4015ef04589c973f6	2019-10-08 12:13:34 -07:00
Mike Ruberry	7f183a978f	Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444 ) Summary: This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers. Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are: - test_autograd.py - test_distributions.py - test_jit.py - test_nn.py This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting. Notable technical changes in this PR are: - Significant updates to test_torch.py to make it pass without setting the default floating dtype globally. - The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously. - test_torch-specific parts of common_utils were refactored into test_torch. - tensor creation methods in common_utils were updated to accept an optional dtype and device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444 Differential Revision: D17795235 Pulled By: mruberry fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1	2019-10-08 09:52:44 -07:00
Mike Ruberry	527b10c2d1	Fixes PackedSequence.to (and unifies PackedSequence conversions) (#27245 ) Summary: PackedSequence.to(device) incorrectly places one of three tensors on the device and leaves the other two tensors where they are. If these devices are distinct then further operations on PackedSequence will fail. This behavior is inconsistent with Tensor.to and PackedSequence's behavior when .cuda() is called. Additionally, PackedSequence defines multiple other conversion functions that were independently and inconsistently implemented. This PR unifies all implementations and makes the PackedSequence.to behavior more consistent with Tensor.to. It is not completely consistent per comments. test_device_mask in test_nn.py is updated to validate the new functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27245 Differential Revision: D17757850 Pulled By: mruberry fbshipit-source-id: 58f0bd40f1aa300fb0a91ee743483d645f977dc5	2019-10-04 02:22:41 -07:00
Mike Ruberry	21c229f4e1	Makes more of test_nn generic (#27137 ) Summary: test_nn.py will still require significant work to make generic, however I'm trying to break up the PRs into more manageable chunks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27137 Differential Revision: D17718488 Pulled By: mruberry fbshipit-source-id: 4d9359414838a1d2a957d7a334f6a5df6cb00aeb	2019-10-02 11:35:44 -07:00
Mike Ruberry	3099732017	Creates device generic cuDNN decorators (#26791 ) Summary: - Creates skipCUDAIfNoCudnn, skipCUDAIfCudnnVersionLessThan decorators - Makes several test_nn.py tests generic Many tests in test_nn.py test cuDNN. These tests are guarded on various conditionals using TEST_CUDNN and TEST_CUDNN_VERSION imported from common_cuda.py and custom error messages like 'CUDNN not available' and 'needs cudnn.' This PR suggests using the CUDA base test class instead of common_cuda.py to test cuDNN's availability, at least on generic tests. The CUDA base test class is preferable to common_cuda.py since it only creates a CUDA context if its tests are run. Importing from common_cuda.py, on the other hand, always creates a CUDA context. Using the CUDA base test class is also consistent with how other generic tests are guarded and provides consistent skip messages. One quirk to this approach is that it makes use of the self argument to the test functions to check for cuDNN availability during a test. See test_rnn_retain_variables. The self argument could also be used to check the device type instead of the more verbose torch.device(device).type == 'cuda'. An alternative approach to making test_nn.py generic would be to continue to use common_cuda.py imports, try to keep their skip messages consistent, and not worry about creating unnecessary CUDA contexts. This would preclude writing generic tests that can only run on CUDA if cuDNN is available, however, so tests like "_test_RNN_cpu_vs_cudnn" would require additional changes to make into device generic precision tests like "_test_RNN_cpu_vs_xla." For consistency, simplicity, and ease of use, I recommend we adopt the proposed decorators and make use of the self argument when productive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26791 Differential Revision: D17678325 Pulled By: mruberry fbshipit-source-id: 1794735ede9bc9f36856e72b3804b136ad3e0de2	2019-10-01 02:23:54 -07:00
Igor Fedan	ee2c79d699	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#27017 ) Summary: https://github.com/pytorch/pytorch/pull/26981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27017 Differential Revision: D17651454 Pulled By: ifedan fbshipit-source-id: c6313caa11598a0ef160e1c6d2f3c33d03ce80c5	2019-09-28 15:08:41 -07:00
Mike Ruberry	8858f42aa4	Revert D17635651: [pytorch][PR] Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. Test Plan: revert-hammer Differential Revision: D17635651 Original commit changeset: 6ec7615207f5 fbshipit-source-id: 1bd5d01856aabd01ff6b472dfa636bcea91c60a5	2019-09-27 21:09:26 -07:00
Igor Fedan	541de7e140	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#26981 ) Summary: https://github.com/pytorch/pytorch/issues/24606 Migrate ne and ne_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24740 Migrate ne and ne_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24573 Migrate gt and gt_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24709 Migrate gt and gt_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24556 Migrate eq and eq_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24696 Migrate eq and eq_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24568 Migrate ge and ge_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24703 Migrate ge and ge_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24582 Migrate le and le_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24719 Migrate le and le_ from the TH to Aten (CPU) Performance characteristics are similar to https://github.com/pytorch/pytorch/issues/25998 This PR migrates comparison ops from TH to ATen and adds type promotion in the same way as in https://github.com/pytorch/pytorch/issues/25998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26981 Differential Revision: D17635651 Pulled By: ifedan fbshipit-source-id: 6ec7615207f5c248a6dd85fc54c25bd5e6d328e6	2019-09-27 17:28:56 -07:00
Dmytro Dzhulgakov	764bf826e3	Remove fbgemm_is_cpu_supported in favor of torch.backends.quantized.supported_qengines (#26840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26840 Cleaning up top-level namespace. Also cosmetic changes to torch.backends.quantized Test Plan: Imported from OSS Differential Revision: D17604403 Pulled By: dzhulgakov fbshipit-source-id: c55af277ea7319d962a82a6120f65ccd47a60abc	2019-09-27 13:45:15 -07:00
Edward Yang	1cae5195a6	Refactor checked_tensor_unwrap to take DeviceType instead of Backend (#26290 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26290 Fixes #26206 Happily, I also can delete the dead Dense***Tensor cases, since they are for the defunct THS backend. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17404368 Pulled By: ezyang fbshipit-source-id: 79d71ad40c4325c9f52d2825aceb65074d2e20e8	2019-09-25 10:59:07 -07:00
Mike Ruberry	98bbb7788c	Updates and extends TestNNDeviceType (#26638 ) Summary: - Moves several tests to TestNNDeviceType - Merges helper base with TestNNDeviceType <s>- Enables non-default stream for TestNN (like recent updates to TestTorch and TestCUDA)</s> Reverted non-default stream due to failure of test_variable_sequence_cuda (main.TestNN). Pull Request resolved: https://github.com/pytorch/pytorch/pull/26638 Differential Revision: D17543899 Pulled By: mruberry fbshipit-source-id: 001fa191f5fe424f2e7adc378b8fb5ee7f264f16	2019-09-23 22:48:21 -07:00
Sebastian Messmer	fcfca9ad62	Skip some fragile tests (#26599 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26599 These fail due to tolerance in equality comparison. Disable them for now. ghstack-source-id: 90553855 Test Plan: unit tests Differential Revision: D17517085 fbshipit-source-id: a4d9278e356318719ccd84047404915a97944f52	2019-09-21 11:06:42 -07:00
Rajan Singh	916eee182c	Fix for Conv shape check prints overflowed ints (#25827 ) Summary: Fix for issue https://github.com/pytorch/pytorch/issues/19947 Pull Request resolved: https://github.com/pytorch/pytorch/pull/25827 Differential Revision: D17508653 Pulled By: soumith fbshipit-source-id: 1afec60b9b39de5f2d0be44a170650aa4c1879cf	2019-09-20 14:11:47 -07:00
Edward Yang	9b7011c5c2	Implement multiple dispatch (#26468 ) (#26501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D17499154 Pulled By: ezyang fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c	2019-09-20 10:12:04 -07:00
Michael Suo	5304358859	Revert D17481256: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17481256 Original commit changeset: b3206936b4ca fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2	2019-09-19 14:53:40 -07:00
Edward Yang	0705f759a3	Implement multiple dispatch (#26468 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bddppq Differential Revision: D17481256 Pulled By: ezyang fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96	2019-09-19 14:29:38 -07:00
Junjie Bai	07bd76988e	Revert D17265918: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17265918 Original commit changeset: 221efe4e86a4 fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b	2019-09-19 09:50:17 -07:00
Edward Yang	ece14ff473	Implement multiple dispatch (#25653 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17265918 Pulled By: ezyang fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d	2019-09-19 09:30:40 -07:00
Mike Ruberry	388cfdf2ac	Removes torchtest, expands generic device testing (#26374 ) Summary: - Removes torchtest - <s>Moves test_torch tests skipped on ROCm to generic device test class</s> - Creates test_nn generic device test class Next: adding dtypes to generic device testing framework. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374 Test Plan: Change is to tests themselves. Differential Revision: D17442218 Pulled By: mruberry fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8	2019-09-18 10:24:50 -07:00
Iurii Zdebskyi	b6d1105eb6	Enabled conv methods for the bfloat16 Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26167 Differential Revision: D17367728 Pulled By: izdeby fbshipit-source-id: 0a7bd9a6dbc15815af195d644c9372af2135e93a	2019-09-16 09:47:42 -07:00
Rohan Varma	4e538ebcf3	Migrate away from using Variable( in test_nn.py (#26077 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26077 As per #26071, we would like to get rid of the calls to Variable( where possible. This diff removes the calls in the test file test_nn.py. The unit tests should all still pass as expected. ghstack-source-id: 90086624 Test Plan: tests in `test_nn.py` should all pass. Differential Revision: D17336484 fbshipit-source-id: 43fc7bd0b0be835ae89d06162ce1cbe4e0056d91	2019-09-16 09:37:54 -07:00
Ailing Zhang	3acab233b5	Add device check before accessing data_ptr in PackLayer (#26056 ) Summary: fixes https://github.com/pytorch/xla/issues/927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26056 Differential Revision: D17331859 Pulled By: ailzhang fbshipit-source-id: bdc334f03c8dcbb4ef4f5e059a63ef188a0b8b61	2019-09-12 19:25:42 -07:00

1 2 3 4 5 ...

656 Commits