pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
mingfeima	f1978b18f9	add mixed data type support for LayerNorm (#81851 ) 1. If user uses amp to run bfloat16 models, `torch.autocast` will keep module paramters in acc dtype which will leave `gamma` and`beta` in float while input/output will be in bfloat16. 2. If user explicitly cast the model to bfloat16 such as: ``` x = torch.randn(n, t, c).bfloat16() ln = nn.LayerNorm(c).bfloat16() y = ln(x) ``` The input/output and gamma/beta will all be in bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81851 Approved by: https://github.com/ezyang	2022-12-01 04:48:34 +00:00
kshitij12345	8314d403a6	[test_nn] split multihead_attention from test_nn (#89748 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89748 Approved by: https://github.com/albanD	2022-11-29 18:15:18 +00:00
Jiong Gong	620994cd7a	Guard the boundary of index computed in compute_source_index_and_lambda (#89252 ) Improve the fix in https://github.com/pytorch/pytorch/pull/89210 See discussion in https://github.com/pytorch/pytorch/issues/89212#issuecomment-1318911969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89252 Approved by: https://github.com/mingfeima, https://github.com/weiwangmeta	2022-11-29 13:55:22 +00:00
Yuxin Wu	56e40fe054	Let SyncBatchNorm fallback to BN if not using distributed training (#89706 ) Fixes #63662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89706 Approved by: https://github.com/soumith	2022-11-27 05:55:24 +00:00
kshitij12345	d3c012f409	[test_nn] split pruning tests from test_nn (#89590 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89590 Approved by: https://github.com/albanD	2022-11-24 21:41:22 +00:00
Nikita Karetnikov	0a1a53083e	[primTorch] Enable regex error testing for some refs (#87765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87765 Approved by: https://github.com/mruberry	2022-11-23 23:36:27 +00:00
kshitij12345	1333fdcff1	[test_nn] split parametrization test from test_nn (#89552 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89552 Approved by: https://github.com/albanD	2022-11-23 17:27:40 +00:00
Kshiteej K	c651944f92	[test_nn] split hooks test from test_nn (#89201 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89201 Approved by: https://github.com/albanD	2022-11-23 08:39:45 +00:00
Kshiteej K	dd140fc351	[test_nn] move init tests from test_nn (#89202 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89202 Approved by: https://github.com/albanD	2022-11-23 08:30:51 +00:00
ecao	3beccbc299	Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU (#82460 ) ### Description * add BFloat16 support for mish and hardtanh backward on CPU. * optimize the performance for silu ### Testing - optimize the performance for silu: bfloat16 single socket (28 cores): ``` before: 1x128x1024 forward 0.090 s backward 0.218 s 10x128x1024 forward 0.146 s backward 0.314 s after: 1x128x1024 forward 0.064 s backward 0.100 s 10x128x1024 forward 0.085 s backward 0.133 s ``` single core: ``` before: 1x128x1024 forward 0.300 s backward 0.606 s 10x128x1024 forward 2.825 s backward 5.834 s after: 1x128x1024 forward 0.156 s backward 0.239 s 10x128x1024 forward 1.447 s backward 2.165 s ``` - Add BFloat16 support for mish and backward of hardtanh on CPU. single socket (20 cores): op \| shape \| fp32 / s \| fp32 / s \| bf16 / s \| bf16 / s -- \| -- \| -- \| -- \| -- \| -- \| \| forward \| backward \| forward \| backward silu \| [10, 128, 10, 10] \| 4.41E-05 \| 7.67E-05 \| 5.32E-05 \| 9.38E-05 \| [10, 128, 80, 80] \| 0.0008 \| 0.001788 \| 0.00067 \| 0.001031 mish \| [10, 128, 10, 10] \| 0.000356 \| 0.000427 \| 0.000367 \| 0.000436 \| [10, 128, 80, 80] \| 0.004527 \| 0.005807 \| 0.004757 \| 0.005393 hardtanh \| [10, 128, 10, 10] \| / \| 3.97E-05 \| / \| 4.45E-05 \| [10, 128, 80, 80] \| / \| 0.001748 \| / \| 0.000645 single core: op \| shape \| fp32 / s \| fp32 / s \| bf16 / s \| bf16 / s -- \| -- \| -- \| -- \| -- \| -- \| \| forward \| backward \| forward \| backward silu \| [10, 128, 10, 10] \| 1.17E-04 \| 1.91E-04 \| 1.35E-04 \| 2.23E-04 \| [10, 128, 80, 80] \| 0.007434 \| 0.013141 \| 0.008464 \| 0.013044 mish \| [10, 128, 10, 10] \| 0.00103 \| 0.00122 \| 0.00106 \| 0.001227 \| [10, 128, 80, 80] \| 0.065629 \| 0.078418 \| 0.067779 \| 0.077214 hardtanh \| [10, 128, 10, 10] \| / \| 1.18E-04 \| / \| 9.30E-05 \| [10, 128, 80, 80] \| / \| 0.010773 \| / \| 0.005834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82460 Approved by: https://github.com/mingfeima, https://github.com/malfet	2022-11-17 08:15:52 +00:00
ecao	44c9185f91	Fix empty input issue of convolution for channels last memory format (#86521 ) Fixes empty input convolution issue : when input is empty e.g. shape of (0, 3, 3, 4) and weight is channels last format, at::_unsafe_view will raise "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead." Pull Request resolved: https://github.com/pytorch/pytorch/pull/86521 Approved by: https://github.com/jgong5, https://github.com/malfet	2022-11-17 04:47:45 +00:00
Jerry Zhang	1adb7b9b84	[nn][utils] Preserve requires_grad from original weight and bias in fuse conv/linear bn weights (#89100 ) Summary: att, previously we just call nn.Parameter which will have requires_grad=True by default, after this PR we will preserve the requires_grad Test Plan: python test/test_nn.py TestFusionUtils Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41343694](https://our.internmc.facebook.com/intern/diff/D41343694) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89100 Approved by: https://github.com/ngimel	2022-11-17 03:58:16 +00:00
Xiao Wang	f5df685090	Enable channels_last_3d on SyncBatchNorm (#88401 ) This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format. With a small benchmark script here https://github.com/pytorch/pytorch/issues/88021#issuecomment-1299059859, on V100, I got master: ``` DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec ``` This PR: ``` DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec ``` This PR is a follow-up of https://github.com/pytorch/pytorch/pull/46906 Close https://github.com/pytorch/pytorch/issues/88021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88401 Approved by: https://github.com/ngimel	2022-11-15 19:25:53 +00:00
Grigory Sizov	7ad87f63e2	Support src_mask and src_key_padding_mask for Better Transformer (#88488 ) Fixes T135842750 (follow-up for #87377) ## Description At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention. This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream. Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device: - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported. - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask. ## Tests - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA - `test_masked_softmax_mask_types` now covers mask type 2 - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488 Approved by: https://github.com/mikekgfb	2022-11-10 08:12:56 +00:00
Samantha Andow	87238e6491	[nn] add remove_duplicate flag to named_parameters (#759 ) (#88090 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/759 Since the remove_duplicate flag was added to named_buffers in D39493161 (`c12f829cce`), this adds the same flag to named_parameters Test Plan: python test/test_nn.py -k test_buffers_and_named_buffers OSS Tests Differential Revision: D40801899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88090 Approved by: https://github.com/albanD	2022-11-09 00:09:20 +00:00
Nikita Karetnikov	bbaa0637df	Add error inputs to `gaussian_nll_loss` `OpInfo` (#88486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88486 Approved by: https://github.com/lezcano	2022-11-05 20:10:54 +00:00
Philip Meier	bc73affdad	prepare removal of deprecated functionality in torch.testing (#87969 ) _Redo of #86586 with all BC breaking changes granularly placed into separate commits._ --- Per title. Deprecation happened on Feb 25, 2022 in `c6f1bbc0ac`, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969 Approved by: https://github.com/mruberry	2022-11-02 14:04:48 +00:00
Grigory Sizov	4c78c7c82a	Enable `src_mask` in fast path of `TransformerEncoderLayer` (#87377 ) ## Issues Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674 ## Description Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path: - Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type. - If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used ## Tests: - `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask - `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation - `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match ## Note I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason: - `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26) - If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests - Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double` Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377 Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet	2022-10-31 19:59:36 +00:00
Kshiteej K	6735bf21c7	[test_nn] split convolution tests from test_nn (#87474 ) Ref #63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87474 Approved by: https://github.com/albanD	2022-10-31 04:42:45 +00:00
Eddie Yan	c5cb6ec066	Allow 64bit indexing for channels-last upsample2d on CUDA (#87901 ) #81665 CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/87901 Approved by: https://github.com/ngimel	2022-10-28 19:33:42 +00:00
eqy	4c8e1a9829	Fix 64bit indexing in `vol2col` (#87527 ) Surfaced from #87354 CC @ngimel @ptrblck @maybeLee Pull Request resolved: https://github.com/pytorch/pytorch/pull/87527 Approved by: https://github.com/ngimel	2022-10-23 21:17:12 +00:00
Antonio Kim	6b59d9b566	Fix registration hooks (#87369 ) There is a bug in the implementation of the registration hooks introduced in https://github.com/pytorch/pytorch/pull/86148 whereby if the hook returns a tensor, then the short circuiting logic: ``` value = hook(self, name, value) or value ``` Raises an exception ``` RuntimeError: Boolean value of Tensor with more than one value is ambiguous ``` Fixing the logic so that it only checks to see if the value is `None` before overriding Fixes #85837 CC: @albanD @jbschlosser Pull Request resolved: https://github.com/pytorch/pytorch/pull/87369 Approved by: https://github.com/albanD	2022-10-21 05:12:25 +00:00
Rui Zhu	4b757f4633	Assert if padding mask type is unexpected (#86353 ) (#87106 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353 Fix the issue described in https://github.com/pytorch/pytorch/issues/86120 Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad Differential Revision: D40129968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106 Approved by: https://github.com/malfet	2022-10-20 16:01:54 +00:00
Kshiteej K	54ee95c8ec	[nn] module: full_backward_pre_hook (#86700 ) Fixes https://github.com/pytorch/pytorch/issues/42824 * [x] Test * [x] Doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/86700 Approved by: https://github.com/soulitzer	2022-10-13 17:36:39 +00:00
CaoE	b79bac0e4d	Make the data types of output and input consistenst for batchnorm (#84410 ) The model TTS will crash due to the issue:: when input of BN is not contiguous and the data type of input is different with that of parameters, BN will raise error `RuntimeError: !needs_dynamic_casting<func_t>::check(iter) INTERNAL ASSERT FAILED at "xxx/pytorch/aten/src/ATen/native/cpu/Loops.h":311, please report a bug to PyTorch`. Make the data types of output and input consistenst for batchnorm to fix the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84410 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2022-10-13 00:42:46 +00:00
Antonio Kim	09a676f639	Add hooks for register_buffer/module/parameter (#86148 ) As described in the issue, this PR adds hooks to be run when `register_parameter`, `register_buffer` and `register_module` are called. Fixes #85837 cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are Pull Request resolved: https://github.com/pytorch/pytorch/pull/86148 Approved by: https://github.com/albanD	2022-10-12 20:57:22 +00:00
Nikita Karetnikov	d56017a14f	[primTorch] Add ref for `triplet_margin_loss`, improve `triplet_margin_with_distance_loss` (#85614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85614 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-12 18:37:58 +00:00
Nikita Shulga	9eb4f9dd17	Tweak test tolerances to be compatible with A10G (#86538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538 Approved by: https://github.com/ngimel	2022-10-11 23:31:48 +00:00
Jerry Zhang	c12f829cce	[nn] Add remove_duplicate flag to named_buffers (#674 ) (#85903 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84984 this is to allow named_buffers to return the same buffer objects with different names multiple times, needed by internal use cases ghstack-source-id: 168589597 Test Plan: python test/test_nn.py -k test_buffers_and_named_buffers Imported from OSS Reviewed By: albanD Differential Revision: D39493161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85903 Approved by: https://github.com/albanD	2022-10-11 18:49:09 +00:00
Kshiteej K	e18d466f35	[test_nn] split lazy_modules from test_nn (#86526 ) Ref: #63085 NOTE: We don't need an accompanying XLA PR as these tests run only on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86526 Approved by: https://github.com/albanD	2022-10-10 16:29:56 +00:00
Pearu Peterson	6b295cd046	Enable autograd on Linear with sparse COO weight (#86302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86302 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:31 +00:00
Pearu Peterson	f104490d63	Support autograd on Linear with sparse compressed weight. (#86137 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86137 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:25 +00:00
Kshiteej K	6a5550fca4	[test_nn] split embedding tests from test_nn (#85892 ) Ref https://github.com/pytorch/pytorch/issues/63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85892 Approved by: https://github.com/albanD	2022-09-30 21:45:40 +00:00
lezcano	787028cadb	Implement col2im decomposition and fix im2col and add a few preconditions (#85541 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/85541 Approved by: https://github.com/jansel	2022-09-30 09:31:53 +00:00
George Qi	85258ec17e	Add mask_type=2 to masked_softmax for when mask.size() == input.size() (#85915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85915 Approved by: https://github.com/cpuhrsch	2022-09-29 23:13:37 +00:00
Masaki Kozuki	ef0baba23f	Use `int64_t` for nll_loss with cuda inputs (#85395 ) Related #85005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85395 Approved by: https://github.com/t-vi, https://github.com/lezcano	2022-09-29 17:02:04 +00:00
Mikayla Gawarecki	afaee00fec	Add python `nested_tensor` and `as_nested_tensor` constructors in `torch.nested` (#85593 ) Remove `torch.nested_tensor` which has erroneous behavior wrt gradients (could be either leaf or not leaf). Introduce `torch.nested.nested_tensor` and `torch.nested.as_nested_tensor` in the vein of `torch.tensor` and `torch.as_tensor`. Done in nested `__init__.py` for now but can move to pybind in future (when we want to load from numpy/nested lists ). Discussed offline with @cpuhrsch and pybind constructor (https://github.com/pytorch/pytorch/pull/85536) was more gnarly than expected, so we can move to that when we do need loading from numpy etc. Differential Revision: [D39806622](https://our.internmc.facebook.com/intern/diff/D39806622) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85593 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2022-09-28 20:15:02 +00:00
Weiyi Zheng	b2311192e6	[NN module] speed up _load_from_state_dict (#85743 ) Fixes #61398 The original implementation is very slow when the state_dict.keys() is long. This PR only passes relevant keys to the child module. existing test passes: `pytest test/test_nn.py -k state_dict` I couldn't figure out a good way to write a new test for this new behavior. Had a new snippet, but it will be flaky if integrated into the main CI because it's a timing based check. But I can verify that the test took 30s to run, after this PR it only takes 0.5s. ```python def test_load_state_dict_large(self): # construct a module with 4 levels of module, 10 linear each, leads to 10k items in the dictionary import copy import time base_module = nn.Linear(1,1) model = base_module for level in range(4): model = nn.Sequential(*[copy.deepcopy(model) for _ in range(10)]) state_dict = model.state_dict() self.assertEqual(len(state_dict), 20000) st = time.time() model.load_state_dict(state_dict, strict=True) strict_load_time = time.time() - st # it took 0.5 seconds to self.assertLess(strict_load_time, 10) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/85743 Approved by: https://github.com/albanD	2022-09-28 15:26:03 +00:00
Eddie Yan	2bc82163eb	Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (re-open of #85037 ) (#85373 ) CC @ngimel @xwang233 @ptrblck Adds fix for `get_tolerances`, tested locally on a dgx Volta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85373 Approved by: https://github.com/ngimel	2022-09-22 07:34:47 +00:00
Mikayla Gawarecki	77f1f98479	Re-introduce `torch.Tensor.to_padded_tensor` (#85293 ) Differential Revision: [D39629004](https://our.internmc.facebook.com/intern/diff/D39629004) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85293 Approved by: https://github.com/cpuhrsch	2022-09-21 18:45:56 +00:00
PyTorch MergeBot	53fdd60635	Revert "Reduce memory usage requirement of `test_warp_softmax_64bit_indexing` in `test_nn.py` (#85037 )" This reverts commit `66a9cba221`. Reverted https://github.com/pytorch/pytorch/pull/85037 on behalf of https://github.com/clee2000 due to broke test_warp_softmax_64bit_indexing_cuda_float32 and test_warp_softmax_64bit_indexing_cuda_float16 on rocm https://github.com/pytorch/pytorch/actions/runs/3085764744/jobs/4989643817	2022-09-20 00:13:41 +00:00
eqy	66a9cba221	Reduce memory usage requirement of `test_warp_softmax_64bit_indexing` in `test_nn.py` (#85037 ) For reference: #84944 CC @xwang233 @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85037 Approved by: https://github.com/ngimel, https://github.com/pmeier	2022-09-19 21:31:08 +00:00
Elias Ellison	f37069aac7	Re-enable fixed dynamo tests (#84969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84969 Approved by: https://github.com/bdhirsh, https://github.com/ezyang	2022-09-16 15:36:52 +00:00
Michael Melesse	b6d6a78c12	[ROCM] test_batchnorm_cudnn_nhwc (#84603 ) This pr enables test_batchnorm_cudnn_nhwc. This is a follow up to https://github.com/pytorch/pytorch/pull/82512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84603 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2022-09-14 15:50:14 +00:00
Mikayla Gawarecki	e217b30b0f	Add `torch.nested` namespace (#84102 ) First step towards #83775 - only `to_padded_tensor` is moved to the nested namespace for now - following the schema used for `special`, `fft`, `linalg` and other namespaces, nested functions are registered in native_functions.yaml as `nested_{function_name}` and are bound to the desired Python name in `torch/nested/__init__.py`, and the desired C++ name in `torch/csrc/api/include/torch/nested.h`. ~~Question: should we keep the documentation for `Tensor.to_padded_tensor` or can this deleted since it is shared by `torch.nested.to_padded_tensor`?~~ [generated nested docs](https://docs-preview.pytorch.org/84102/nested.html?highlight=nested#module-torch.nested) Differential Revision: [D39361148](https://our.internmc.facebook.com/intern/diff/D39361148) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84102 Approved by: https://github.com/drisspg	2022-09-12 16:31:05 +00:00
Kshiteej K	6d6e04d6cc	[test_nn] move dropout tests to test/nn/test_dropout.py (#84165 ) Ref https://github.com/pytorch/pytorch/issues/63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84165 Approved by: https://github.com/albanD	2022-09-03 07:21:48 +00:00
Elias Ellison	f701cb04fb	Test Dynamo CI w Fake Tensors (#84282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84282 Approved by: https://github.com/anijain2305	2022-09-01 00:15:05 +00:00
lezcano	b106a04d76	Fix the edge case when y = 0 in kl_div (#82714 ) Brought up in https://github.com/pytorch/pytorch/pull/80334#issuecomment-1193600883 We also prepare its opinfo to fix https://github.com/pytorch/pytorch/issues/80488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82714 Approved by: https://github.com/albanD	2022-08-30 18:18:25 +00:00
Edward Z. Yang	ad44670fa1	Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628 )" (#84173 ) Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature" Original commit changeset: dab4a9dba4fa Original commit changeset: dcaf16c037a9 Original Phabricator Diff: D38984222 Original Phabricator Diff: D39075159 Also update Metal registrations for C++ registration changes. Also update NNPI registration to account for tightened schema checking Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173 Approved by: https://github.com/Krovatkin	2022-08-29 18:01:07 +00:00
soulitzer	7088a98fba	conv2d: require bias to have the same dtype as input and weight on cpu (#83686 ) Fixes https://github.com/pytorch/pytorch/issues/83505 BC-breaking message: - Previously we only required input and weight to have the same dtype on cpu (when input is non-complex). After this change, the dtype of bias is now also expected to have the same dtype. This change was necessary to improve the error message for certain combinations of inputs. This behavior now also matches that of convolution on cuda. <details> <summary> Old plan </summary> Previously convolution (at least for slow_conv2d) did not perform type promotion, i.e. the output of `conv(int, int, float)` is an int, and that leads to the autograd assert. This PR adds type promotion handling at the `at::native::conv2d` (this is a composite) level. We also need to correct or remove many tests that assume that conv errors when input types are mixed Pros: - Doing type promotion at this level avoids the complex path from having any special handling for mixed dtypes, and can potentially speed up mixed dtype inputs to now dispatch to faster kernels which are only capable of handling floats. Cons: - Doing type promotion at this level has the risk of introducing extra overhead when we would've dispatched to a kernel capable of handle mixed type anyway. I don't know if any of these exist at all though - it is possible that inputs with any non-float arguments are dispatched to the slow path. If this approach is OK, we can proceed with the other convolutions as well: </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83686 Approved by: https://github.com/ngimel	2022-08-29 16:41:17 +00:00
Natalia Gimelshein	0ac2986d33	Fixes softmax indexing for large tensors (#84182 ) Fixes #84144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84182 Approved by: https://github.com/janeyx99	2022-08-29 04:29:09 +00:00
PyTorch MergeBot	c7edcd6968	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `9790d90e4b`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487	2022-08-27 01:23:17 +00:00
Animesh Jain	6a58603956	Update Dynamo pin (#83829 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829 Approved by: https://github.com/ezyang	2022-08-26 20:49:43 +00:00
Edward Z. Yang	9790d90e4b	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-26 01:35:40 +00:00
zaf	2f04ba2c7c	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-25 16:50:38 +00:00
XiaobingSuper	a013597b32	fix oneDNN channels_last path issue (#83653 ) Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653 Approved by: https://github.com/mingfeima, https://github.com/malfet	2022-08-25 03:58:11 +00:00
PyTorch MergeBot	a7edf71360	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `8fae7027b3`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222	2022-08-25 00:49:40 +00:00
kshitij12345	7a8152530d	move pooling test from test_nn to test/nn/test_pooling (#83915 ) Ref #63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915 Approved by: https://github.com/albanD	2022-08-24 16:17:50 +00:00
Ishan-Rajgarhia	7fdc2f70c6	Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870 ) See https://github.com/pytorch/pytorch/issues/38095 Replaced assertEqualIgnoreType with assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870 Approved by: https://github.com/kit1980	2022-08-24 02:45:52 +00:00
Edward Z. Yang	8fae7027b3	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-23 22:04:07 +00:00
Khushi Agrawal	9095030239	[fix] edge case in `MaxPool1d` and add ErrorInputs (#83553 ) Fixes #83224 cc @kshitij12345 @albanD! Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553 Approved by: https://github.com/albanD	2022-08-23 19:23:39 +00:00
Kshiteej K	dd67d52b57	[nn] split rnn_utils test from test_nn.py (#83675 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Proposed folder structure ``` -> test -> nn -> test_conv.py -> test_pooling.py -> ..... ``` This PR: Moves test related RNN utilities to a different file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675 Approved by: https://github.com/albanD	2022-08-23 08:34:39 +00:00
XiaobingSuper	658f958bc4	fix upsample bf16 issue for channels last path by using high pricsion to compute index (#83847 ) Given the following case: ``` import torch a = torch.ones(1, 3, 320, 480).bfloat16().to(memory_format=torch.channels_last) out_bf16 = torch.nn.functional.interpolate(a, size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False) out_fp32= torch.nn.functional.interpolate(a.float(), size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False) print(out_bf16[0, 2, :, :]) print(out_fp32[0, 2, :, :]) ``` the boundary of bfloat16 output gets a wrong value: ``` tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], ..., [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [0.0000e+00, 0.0000e+00, 1.8367e-40, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00]], dtype=torch.bfloat16) tensor([[1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], ..., [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.]]) ``` the expected behavior is that the bfloat16 output value should also be one. The main reason is that we use low precision to compute the index, see `fcb124406b/aten/src/ATen/native/UpSample.h (L448)`, we should use a high precison to do the computation as GPU path: `fcb124406b/aten/src/ATen/native/cuda/UpSample.cuh (L123)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83847 Approved by: https://github.com/frank-wei	2022-08-23 00:53:37 +00:00
PyTorch MergeBot	4cbb1986fe	Revert "[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 )" This reverts commit `7cd2fa1d38`. Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted	2022-08-22 07:23:24 +00:00
zaf	7cd2fa1d38	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-22 05:33:23 +00:00
Rui Zhu	e0f2eba93d	Move odd num_head in TransformerEncoder to slow_path (#83483 ) Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path Test Plan: CI Differential Revision: D38720086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483 Approved by: https://github.com/erichan1	2022-08-20 10:02:08 +00:00
Jeff Daily	d52d2bd5a9	[ROCm] MIOpen fused convolution relu (#82002 ) Adds MIOpen fused convolution relu for fp32 and contiguous memory format. Adds fallbacks for conv + z + bias + relu, fp16, and channels last until MIOpen adds these features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82002 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-08-16 20:49:33 +00:00
Nicolas Macchioni	b236352036	Add mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947 Transformer fastpath multiplexes two arguments, src_mask [seq_len x seq_len] and src_key_padding_mask [batch_size x seq_len], and later deduces the type based on mask shape. In the event that batch_size == seq_len, any src_mask is wrongly interpreted as a src_key padding_mask. This is fixed by requiring a mask_type identifier be supplied whenever batch_size == seq_len. Additionally, added support for src_mask in masked_softmax CPU path. Test Plan: existing unit tests + new unit tests (batch_size == seq_len) Differential Revision: D37932240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947 Approved by: https://github.com/zrphercule	2022-08-09 23:42:16 +00:00
Sergii Dymchenko	7390ae837c	Resolve TODO for GroupNorm numerical issues (#82423 ) Looks like the numerical issues are resolved now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82423 Approved by: https://github.com/ngimel	2022-08-03 19:42:26 +00:00
Jiayi Sun	15a284b09e	optimize softmax backward and logsoftmax backward (#80114 ) Currently, if we run softmax_backward/logsoftmax_backward which are not along the last dim, the calculation will fall to a [scalar version](`32593ef2dd/aten/src/ATen/native/SoftMax.cpp (L220-L287)`). And we find actually we have the chance to vectorize the calculation along the inner_size dim. Changes we made: Use vectorized softmax_backward_kernel/log_softmax_backward_kernel instead of host_softmax_backward when not along the last dim. We collected the benchmark data of softmax_backward and logsoftmax_backward for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz. Number of cores: 24 cores(1 socket) [softmax_benchmark_32593ef.log](https://github.com/pytorch/pytorch/files/8962956/softmax_benchmark_32593ef.log) [softmax_benchmark_the_pr.log](https://github.com/pytorch/pytorch/files/8962958/softmax_benchmark_the_pr.log) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80114 Approved by: https://github.com/frank-wei	2022-08-03 00:36:28 +00:00
mingfeima	b019a41674	fix bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392 ) To fix https://github.com/pytorch/pytorch/issues/82060 When `input` is not explicitly converted to channels last while `conv` has, the output should also be in channels last. The root cause is that when input has IC of 1, `compute_columns2d` from `\aten\src\ATen\native\ConvolutionMM2d.cpp` would consider it as channels first: We do have logic to make sure both input and weight have the same memory format even if they are given differently, like: ``` auto input = self.contiguous(memory_format); auto weight = weight_.contiguous(memory_format); ``` But for a N1HW input, `.contiguous(MemoryFormat::ChannelsLast)` would not change its stride , and its `suggest_memory_format()` still returns `MemoryFormat::Contiguous`. That's how it went wrong. Also updated the corresponding test cases, without this patch, the new test case would fail on forward path and runtime error on backward path. attach old fail log on forward path: ``` FAIL: test_conv_thnn_nhwc_cpu_float32 (__main__.TestNNDeviceTypeCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test result = test(self, *param_kwargs) File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 974, in only_fn return fn(slf, args, **kwargs) File "test/test_nn.py", line 19487, in test_conv_thnn_nhwc input_format=torch.contiguous_format, weight_format=torch.channels_last) File "test/test_nn.py", line 19469, in helper self.assertEqual(out, ref_out, exact_dtype=False) File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2376, in assertEqual msg=(lambda generated_msg: f"{generated_msg} : {msg}") if isinstance(msg, str) and self.longMessage else msg, File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal raise error_metas[0].to_error(msg) AssertionError: Tensor-likes are not close! Mismatched elements: 988 / 1024 (96.5%) Greatest absolute difference: 42.0 at index (1, 2, 6, 6) (up to 1e-05 allowed) Greatest relative difference: inf at index (0, 0, 2, 1) (up to 1.3e-06 allowed) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/82392 Approved by: https://github.com/jbschlosser	2022-07-28 14:20:52 +00:00
Khushi Agrawal	050aec1805	[nn] add `pop` to sequential and ModuleList (#81601 ) Follows #71329 cc @kshitij12345! Pull Request resolved: https://github.com/pytorch/pytorch/pull/81601 Approved by: https://github.com/albanD	2022-07-25 19:32:32 +00:00
Ansh Radhakrishnan	110cd724fc	[nn] Add support for +=, * and *= operations for nn.Sequential objects (#81279 ) Fixes 71329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81279 Approved by: https://github.com/albanD	2022-07-25 15:48:47 +00:00
soulitzer	f595467e5c	Reenable slow gradcheck and make it pass (#80514 ) Context: For a while slow gradcheck CI was skipping nearly all tests and this hid the fact that it should've been failing and timing out (10+h runtime for TestGradients). The CI configuration has since been fixed to correct this, revealing the test failures. This PR reenables slow gradcheck CI and makes it pass again. This PR: - makes slow and failing tests run in fast gradcheck mode only - reduce the input size for slow gradcheck only for unary/binary ufuncs (alternatively, skip the test entirely) - skip entire test files on slow gradcheck runner if they don't use gradcheck (test_ops, test_meta, test_decomp, test_ops_jit) - reduces the input size for some ops Follow ups: 1. Investigate slow mode failures https://github.com/pytorch/pytorch/issues/80411 2. See if we can re-enable slow gradcheck tests for some of the slow tests by reducing the sizes of their inputs The following are failing in slow mode, they are now running in fast mode only. ``` test_fn_fwgrad_bwgrad___rmod___cuda_float64 test_fn_fwgrad_bwgrad_linalg_householder_product_cuda_complex128 test_fn_fwgrad_bwgrad__masked_prod_cuda_complex128 test_fn_fwgrad_bwgrad__masked_prod_cuda_float64 test_fn_fwgrad_bwgrad_linalg_matrix_power_cuda_complex128 test_fn_fwgrad_bwgrad_cat_cuda_complex128 test_fn_fwgrad_bwgrad_linalg_lu_factor_ex_cuda_float64 test_fn_fwgrad_bwgrad_copysign_cuda_float64 test_fn_fwgrad_bwgrad_cholesky_inverse_cuda_complex128 test_fn_fwgrad_bwgrad_float_power_cuda_complex128 test_fn_fwgrad_bwgrad_fmod_cuda_float64 test_fn_fwgrad_bwgrad_float_power_cuda_float64 test_fn_fwgrad_bwgrad_linalg_lu_cuda_float64 test_fn_fwgrad_bwgrad_remainder_cuda_float64 test_fn_fwgrad_bwgrad_repeat_cuda_complex128 test_fn_fwgrad_bwgrad_prod_cuda_complex128 test_fn_fwgrad_bwgrad_slice_scatter_cuda_float64 test_fn_fwgrad_bwgrad_tile_cuda_complex128 test_fn_fwgrad_bwgrad_pow_cuda_float64 test_fn_fwgrad_bwgrad_pow_cuda_complex128 test_fn_fwgrad_bwgrad_fft_* test_fn_fwgrad_bwgrad_zero__cuda_complex128 test_fn_gradgrad_linalg_lu_factor_cuda_float64 test_fn_grad_div_trunc_rounding_cuda_float64 test_fn_grad_div_floor_rounding_cuda_float64 ``` Marks the OpInfos for the following ops that run slowly in slow gradcheck as `fast_gradcheck` only (the left column represents runtime in seconds): ``` 0 918.722 test_fn_fwgrad_bwgrad_nn_functional_conv_transpose3d_cuda_float64 1 795.042 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_complex128 2 583.63 test_fn_fwgrad_bwgrad_nn_functional_max_pool3d_cuda_float64 3 516.946 test_fn_fwgrad_bwgrad_svd_cuda_complex128 4 503.179 test_fn_fwgrad_bwgrad_linalg_svd_cuda_complex128 5 460.985 test_fn_fwgrad_bwgrad_linalg_lu_cuda_complex128 6 401.04 test_fn_fwgrad_bwgrad_linalg_lstsq_grad_oriented_cuda_complex128 7 353.671 test_fn_fwgrad_bwgrad_nn_functional_max_pool2d_cuda_float64 8 321.903 test_fn_fwgrad_bwgrad_nn_functional_gaussian_nll_loss_cuda_float64 9 307.951 test_fn_fwgrad_bwgrad_stft_cuda_complex128 10 266.104 test_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64 11 221.032 test_fn_fwgrad_bwgrad_istft_cuda_complex128 12 183.741 test_fn_fwgrad_bwgrad_lu_unpack_cuda_complex128 13 132.019 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_float64 14 125.343 test_fn_fwgrad_bwgrad_nn_functional_pad_constant_cuda_complex128 15 124.2 test_fn_fwgrad_bwgrad_kron_cuda_complex128 16 123.721 test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64 17 121.074 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64 18 119.387 test_fn_fwgrad_bwgrad_rot90_cuda_complex128 19 112.889 test_fn_fwgrad_bwgrad__masked_normalize_cuda_complex128 20 107.541 test_fn_fwgrad_bwgrad_dist_cuda_complex128 21 106.727 test_fn_fwgrad_bwgrad_diff_cuda_complex128 22 104.588 test_fn_fwgrad_bwgrad__masked_cumprod_cuda_complex128 23 100.135 test_fn_fwgrad_bwgrad_nn_functional_feature_alpha_dropout_with_train_cuda_float64 24 88.359 test_fn_fwgrad_bwgrad_mH_cuda_complex128 25 86.214 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64 26 83.037 test_fn_fwgrad_bwgrad_nn_functional_bilinear_cuda_float64 27 79.987 test_fn_fwgrad_bwgrad__masked_cumsum_cuda_complex128 28 77.822 test_fn_fwgrad_bwgrad_diag_embed_cuda_complex128 29 76.256 test_fn_fwgrad_bwgrad_mT_cuda_complex128 30 74.039 test_fn_fwgrad_bwgrad_linalg_lu_solve_cuda_complex128 ``` ``` 0 334.142 test_fn_fwgrad_bwgrad_unfold_cuda_complex128 1 312.791 test_fn_fwgrad_bwgrad_linalg_lu_factor_cuda_complex128 2 121.963 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64 3 108.085 test_fn_fwgrad_bwgrad_diff_cuda_complex128 4 89.418 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64 5 72.231 test_fn_fwgrad_bwgrad___rdiv___cuda_complex128 6 69.433 test_fn_fwgrad_bwgrad___getitem___cuda_complex128 7 68.582 test_fn_fwgrad_bwgrad_ldexp_cuda_complex128 8 68.572 test_fn_fwgrad_bwgrad_linalg_pinv_cuda_complex128 9 67.585 test_fn_fwgrad_bwgrad_nn_functional_glu_cuda_float64 10 66.567 test_fn_fwgrad_bwgrad_lu_cuda_float64 ``` ``` 0 630.13 test_fn_gradgrad_nn_functional_conv2d_cuda_complex128 1 81.086 test_fn_gradgrad_linalg_solve_triangular_cuda_complex128 2 71.332 test_fn_gradgrad_norm_cuda_complex128 3 64.308 test_fn_gradgrad__masked_std_cuda_complex128 4 59.519 test_fn_gradgrad_div_no_rounding_mode_cuda_complex128 5 58.836 test_fn_gradgrad_nn_functional_adaptive_avg_pool3 ``` Reduces the sizes of the inputs for: - diff - diag_embed Pull Request resolved: https://github.com/pytorch/pytorch/pull/80514 Approved by: https://github.com/albanD	2022-07-22 02:05:37 +00:00
Saketh Are	445ee5620e	Simplify torch.nn.grad by calling into aten::convolution_backward (#81839 ) `torch.nn.grad` has its own implementations of gradients for conv1d, conv2d, and conv3d. This PR simplifies them by calling into the unified `aten::convolution_backward` backend instead. The existing implementation of conv2d_weight is incorrect for some inputs (see issue #51430). This PR fixes the issue. This PR expands coverage in test_nn to include conv1d_weight, conv2d_weight, and conv3d_weight, which were previously untested. It also expands the cases for conv2d to cover issue #51430. Fixes #51430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81839 Approved by: https://github.com/albanD	2022-07-21 19:34:27 +00:00
Khushi Agrawal	dced803339	[nn] add `insert` method to sequential class (#81402 ) Follows #71329 cc @kshitij12345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81402 Approved by: https://github.com/albanD	2022-07-20 14:45:52 +00:00
Khushi Agrawal	2c0b11b43b	[nn] implement `extend` method to sequential class (#81179 ) Follows #71329 cc @kshitij12345 :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/81179 Approved by: https://github.com/albanD	2022-07-20 05:33:41 +00:00
PyTorch MergeBot	f82b19f15b	Revert "Disable use_mkldnn when input is not contiguous for oneDNN (#80864 )" This reverts commit `4655c3bace`. Reverted https://github.com/pytorch/pytorch/pull/80864 on behalf of https://github.com/janeyx99 due to Reverting due for a perf regression https://github.com/pytorch/benchmark/issues/1040	2022-07-19 18:58:52 +00:00
yanbing-j	4655c3bace	Disable use_mkldnn when input is not contiguous for oneDNN (#80864 ) Fixes [#80837](https://github.com/pytorch/pytorch/issues/80837). This PR is to disable use_mkldnn when input is not contiguous for oneDNN requirement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80864 Approved by: https://github.com/malfet	2022-07-17 14:58:26 +00:00
Rui Zhu	b22166fd62	Add a small fastpath test for native mha (#81432 ) Summary: We dont have a small fast path passing test for mha before, this diff added one for better testing Test Plan: buck build mode/dev-nosan -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/dev/gen/caffe2/test/nn\#binary.par -r test_multihead_attn_fast_path_small_test Differential Revision: D37834319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81432 Approved by: https://github.com/erichan1	2022-07-15 23:54:40 +00:00
Eric Han	23088fcfdf	disable src mask for transformer and multiheadattention fastpath (#81277 ) Disable fastpath if src_mask passed to TransformerEncoderLayer and MultiheadAttention. - Refactored test_transformerencoder from test_nn.py to test_transformers.py. Added a src_mask test there. - Added a specific src_mask test in test_transformers.py Fixes https://github.com/pytorch/pytorch/issues/81129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81277 Approved by: https://github.com/zrphercule	2022-07-15 20:55:17 +00:00
n.zhuravlev	7af0200a46	Add deepcopy functionality to parametrized modules (#80811 ) Fixes #69413 After applying parametrization to any `nn.Module` we lose the ability to create a deepcopy of it e.g. it makes it impossible to wrap a module by an `AveragedModel`. Specifically, the problem is that the `deepcopy` tries to invoke `__getstate__` if object hasn't implemented its own `__deepcopy__` magic method. But we don't allow serialization of the parametrized modules: `__getstate__` raises an error. My solution is just to create a default `__deepcopy__` method when it doesn't exist yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80811 Approved by: https://github.com/pearu, https://github.com/albanD	2022-07-15 09:06:45 +00:00
Khushi Agrawal	3da8c909da	[nn] add `+` operator for torch.nn.Sequential to concatenate (#81170 ) Fixes #78512 #### TODO - [x] add tests cc @kshitij12345! Pull Request resolved: https://github.com/pytorch/pytorch/pull/81170 Approved by: https://github.com/albanD	2022-07-11 17:49:58 +00:00
eqy	3b78c5682b	Don't implicitly convert to channels-first in MaxPool3D on CUDA (#80748 ) MaxPool3D currently converts inputs implicitly to channels-first (via `.contiguous()`) which may yield unexpected regressions in workloads that expect a full channels-last path. This PR preserves the channels-last format in MaxPool3D while attempting to avoid seriously regressing performance. Currently, typical case (kernel size == 2 == stride) looks good, but larger kernel sizes (>4) or the unusual case of stride 1 can sometimes be slower than converting to channels-first before doing MaxPool3D. Additionally, this PR adds a test for 64bit-indexing backwards as testing of these changes uncovered an IMA for large tensors when doing the backwards pass with MaxPool3D. Performance comparison on A6000: ``` [------------------------------------- max_pool3d ---------------------------------------------------------] \| channels_last=False \| curr ch_last=True \| new ch_last=True 1 threads: ---------------------------------------------------------------------------- --------------------- [64, 256, 32, 32, 32] 4x4 stride 4 \| 20093.5 \| 34823.4 \| 20640.0 [64, 256, 32, 32, 32] 4x4 stride 2 \| 28623.7 \| 42625.6 \| 27935.5 [64, 256, 32, 32, 32] 4x4 stride 1 \| 68177.5 \| 79147.2 \| 85604.8 [64, 256, 32, 32, 32] 2x2 stride 4 \| 17237.7 \| 32071.3 \| 16641.6 [64, 256, 32, 32, 32] 2x2 stride 2 \| 25252.5 \| 39993.2 \| 25054.8 [64, 256, 32, 32, 32] 2x2 stride 1 \| 43185.2 \| 58164.6 \| 48416.9 [64, 256, 16, 16, 16] 4x4 stride 4 \| 3017.7 \| 3952.4 \| 2593.8 [64, 256, 16, 16, 16] 4x4 stride 2 \| 4581.5 \| 5384.3 \| 3294.3 [64, 256, 16, 16, 16] 4x4 stride 1 \| 11334.1 \| 11534.7 \| 8651.1 [64, 256, 16, 16, 16] 2x2 stride 4 \| 2346.9 \| 3304.6 \| 2098.8 [64, 256, 16, 16, 16] 2x2 stride 2 \| 3550.8 \| 4526.5 \| 3143.6 [64, 256, 16, 16, 16] 2x2 stride 1 \| 6898.1 \| 7816.0 \| 5820.8 [64, 256, 4, 4, 4] 4x4 stride 4 \| 191.5 \| 176.3 \| 77.5 [64, 256, 4, 4, 4] 4x4 stride 2 \| 191.8 \| 176.8 \| 94.1 [64, 256, 4, 4, 4] 4x4 stride 1 \| 191.3 \| 176.4 \| 97.3 [64, 256, 4, 4, 4] 2x2 stride 4 \| 96.4 \| 114.4 \| 93.6 [64, 256, 4, 4, 4] 2x2 stride 2 \| 172.1 \| 178.6 \| 93.7 [64, 256, 4, 4, 4] 2x2 stride 1 \| 263.0 \| 279.4 \| 92.4 [64, 64, 32, 32, 32] 4x4 stride 4 \| 5033.2 \| 7208.3 \| 5167.5 [64, 64, 32, 32, 32] 4x4 stride 2 \| 7216.1 \| 9218.7 \| 6637.1 [64, 64, 32, 32, 32] 4x4 stride 1 \| 17192.1 \| 18392.9 \| 20489.0 [64, 64, 32, 32, 32] 2x2 stride 4 \| 4318.0 \| 6511.2 \| 4193.1 [64, 64, 32, 32, 32] 2x2 stride 2 \| 6324.4 \| 8657.7 \| 6263.6 [64, 64, 32, 32, 32] 2x2 stride 1 \| 10855.0 \| 13040.2 \| 12055.9 [64, 64, 16, 16, 16] 4x4 stride 4 \| 764.1 \| 975.6 \| 671.3 [64, 64, 16, 16, 16] 4x4 stride 2 \| 1163.1 \| 1333.4 \| 833.6 [64, 64, 16, 16, 16] 4x4 stride 1 \| 2890.0 \| 2898.5 \| 2209.8 [64, 64, 16, 16, 16] 2x2 stride 4 \| 593.5 \| 811.2 \| 536.3 [64, 64, 16, 16, 16] 2x2 stride 2 \| 895.9 \| 1112.3 \| 794.5 [64, 64, 16, 16, 16] 2x2 stride 1 \| 1742.5 \| 1968.0 \| 1475.2 [64, 64, 4, 4, 4] 4x4 stride 4 \| 101.1 \| 112.2 \| 93.4 [64, 64, 4, 4, 4] 4x4 stride 2 \| 96.7 \| 114.6 \| 92.5 [64, 64, 4, 4, 4] 4x4 stride 1 \| 98.9 \| 111.9 \| 96.5 [64, 64, 4, 4, 4] 2x2 stride 4 \| 100.1 \| 107.1 \| 94.2 [64, 64, 4, 4, 4] 2x2 stride 2 \| 96.6 \| 108.0 \| 94.5 [64, 64, 4, 4, 4] 2x2 stride 1 \| 96.7 \| 107.9 \| 95.2 [64, 3, 32, 32, 32] 4x4 stride 4 \| 250.1 \| 326.6 \| 278.0 [64, 3, 32, 32, 32] 4x4 stride 2 \| 350.4 \| 414.0 \| 323.2 [64, 3, 32, 32, 32] 4x4 stride 1 \| 825.6 \| 846.9 \| 982.5 [64, 3, 32, 32, 32] 2x2 stride 4 \| 213.3 \| 289.8 \| 219.9 [64, 3, 32, 32, 32] 2x2 stride 2 \| 308.2 \| 384.9 \| 305.9 [64, 3, 32, 32, 32] 2x2 stride 1 \| 523.5 \| 594.7 \| 589.9 [64, 3, 16, 16, 16] 4x4 stride 4 \| 103.8 \| 116.7 \| 93.0 [64, 3, 16, 16, 16] 4x4 stride 2 \| 100.9 \| 108.3 \| 93.3 [64, 3, 16, 16, 16] 4x4 stride 1 \| 139.4 \| 140.7 \| 104.8 [64, 3, 16, 16, 16] 2x2 stride 4 \| 97.5 \| 114.7 \| 92.7 [64, 3, 16, 16, 16] 2x2 stride 2 \| 97.4 \| 108.8 \| 91.7 [64, 3, 16, 16, 16] 2x2 stride 1 \| 99.9 \| 108.0 \| 94.1 [64, 3, 4, 4, 4] 4x4 stride 4 \| 97.2 \| 110.2 \| 94.7 [64, 3, 4, 4, 4] 4x4 stride 2 \| 105.7 \| 107.4 \| 92.8 [64, 3, 4, 4, 4] 4x4 stride 1 \| 98.0 \| 110.0 \| 93.7 [64, 3, 4, 4, 4] 2x2 stride 4 \| 98.3 \| 116.7 \| 93.0 [64, 3, 4, 4, 4] 2x2 stride 2 \| 98.6 \| 107.5 \| 92.8 [64, 3, 4, 4, 4] 2x2 stride 1 \| 100.6 \| 110.3 \| 94.0 [16, 256, 32, 32, 32] 4x4 stride 4 \| 5034.2 \| 8838.0 \| 5165.9 [16, 256, 32, 32, 32] 4x4 stride 2 \| 7236.3 \| 10869.9 \| 7038.2 [16, 256, 32, 32, 32] 4x4 stride 1 \| 17385.4 \| 21401.6 \| 21900.7 [16, 256, 32, 32, 32] 2x2 stride 4 \| 4318.7 \| 8101.2 \| 4172.9 [16, 256, 32, 32, 32] 2x2 stride 2 \| 6324.0 \| 10147.5 \| 6279.7 [16, 256, 32, 32, 32] 2x2 stride 1 \| 10899.7 \| 14826.0 \| 12256.3 [16, 256, 16, 16, 16] 4x4 stride 4 \| 765.4 \| 1012.7 \| 675.6 [16, 256, 16, 16, 16] 4x4 stride 2 \| 1162.8 \| 1376.9 \| 843.4 [16, 256, 16, 16, 16] 4x4 stride 1 \| 2928.9 \| 2969.8 \| 2222.5 [16, 256, 16, 16, 16] 2x2 stride 4 \| 593.5 \| 845.8 \| 534.2 [16, 256, 16, 16, 16] 2x2 stride 2 \| 896.9 \| 1152.2 \| 796.9 [16, 256, 16, 16, 16] 2x2 stride 1 \| 1750.2 \| 2009.4 \| 1481.8 [16, 256, 4, 4, 4] 4x4 stride 4 \| 96.6 \| 107.1 \| 92.7 [16, 256, 4, 4, 4] 4x4 stride 2 \| 97.9 \| 114.9 \| 93.8 [16, 256, 4, 4, 4] 4x4 stride 1 \| 98.2 \| 115.6 \| 94.0 [16, 256, 4, 4, 4] 2x2 stride 4 \| 97.0 \| 106.7 \| 93.8 [16, 256, 4, 4, 4] 2x2 stride 2 \| 96.8 \| 108.1 \| 93.3 [16, 256, 4, 4, 4] 2x2 stride 1 \| 95.8 \| 120.9 \| 95.7 [16, 64, 32, 32, 32] 4x4 stride 4 \| 1266.4 \| 1815.4 \| 1312.3 [16, 64, 32, 32, 32] 4x4 stride 2 \| 1818.5 \| 2328.0 \| 1678.9 [16, 64, 32, 32, 32] 4x4 stride 1 \| 4352.9 \| 4649.3 \| 5204.6 [16, 64, 32, 32, 32] 2x2 stride 4 \| 1090.0 \| 1631.2 \| 1060.8 [16, 64, 32, 32, 32] 2x2 stride 2 \| 1589.4 \| 2141.1 \| 1576.4 [16, 64, 32, 32, 32] 2x2 stride 1 \| 2733.5 \| 3286.0 \| 3041.6 [16, 64, 16, 16, 16] 4x4 stride 4 \| 201.7 \| 259.6 \| 175.0 [16, 64, 16, 16, 16] 4x4 stride 2 \| 301.0 \| 350.1 \| 226.3 [16, 64, 16, 16, 16] 4x4 stride 1 \| 740.1 \| 748.7 \| 570.6 [16, 64, 16, 16, 16] 2x2 stride 4 \| 156.0 \| 214.8 \| 140.8 [16, 64, 16, 16, 16] 2x2 stride 2 \| 232.3 \| 292.3 \| 208.7 [16, 64, 16, 16, 16] 2x2 stride 1 \| 449.1 \| 504.0 \| 382.1 [16, 64, 4, 4, 4] 4x4 stride 4 \| 97.5 \| 111.4 \| 94.5 [16, 64, 4, 4, 4] 4x4 stride 2 \| 98.8 \| 111.9 \| 94.4 [16, 64, 4, 4, 4] 4x4 stride 1 \| 98.2 \| 112.0 \| 95.2 [16, 64, 4, 4, 4] 2x2 stride 4 \| 99.7 \| 111.0 \| 94.0 [16, 64, 4, 4, 4] 2x2 stride 2 \| 100.3 \| 110.0 \| 93.2 [16, 64, 4, 4, 4] 2x2 stride 1 \| 97.5 \| 107.6 \| 93.5 [16, 3, 32, 32, 32] 4x4 stride 4 \| 100.5 \| 117.1 \| 95.7 [16, 3, 32, 32, 32] 4x4 stride 2 \| 97.5 \| 121.3 \| 92.5 [16, 3, 32, 32, 32] 4x4 stride 1 \| 216.0 \| 227.4 \| 258.4 [16, 3, 32, 32, 32] 2x2 stride 4 \| 97.1 \| 109.0 \| 91.9 [16, 3, 32, 32, 32] 2x2 stride 2 \| 95.8 \| 108.5 \| 92.9 [16, 3, 32, 32, 32] 2x2 stride 1 \| 139.4 \| 161.2 \| 157.8 [16, 3, 16, 16, 16] 4x4 stride 4 \| 96.4 \| 113.6 \| 91.9 [16, 3, 16, 16, 16] 4x4 stride 2 \| 97.4 \| 108.1 \| 93.5 [16, 3, 16, 16, 16] 4x4 stride 1 \| 99.0 \| 107.5 \| 92.1 [16, 3, 16, 16, 16] 2x2 stride 4 \| 96.9 \| 118.1 \| 93.4 [16, 3, 16, 16, 16] 2x2 stride 2 \| 97.3 \| 106.7 \| 95.8 [16, 3, 16, 16, 16] 2x2 stride 1 \| 98.8 \| 109.2 \| 93.8 [16, 3, 4, 4, 4] 4x4 stride 4 \| 97.8 \| 108.0 \| 94.2 [16, 3, 4, 4, 4] 4x4 stride 2 \| 92.7 \| 108.0 \| 93.9 [16, 3, 4, 4, 4] 4x4 stride 1 \| 97.8 \| 107.6 \| 93.5 [16, 3, 4, 4, 4] 2x2 stride 4 \| 100.3 \| 107.7 \| 94.3 [16, 3, 4, 4, 4] 2x2 stride 2 \| 97.2 \| 107.5 \| 96.1 [16, 3, 4, 4, 4] 2x2 stride 1 \| 98.1 \| 111.1 \| 93.8 Times are in microseconds (us). ``` Performance comparison on V100: (these times have been updated after working around some noisy measurements in my setup) ``` [------------------------------------- max_pool3d ---------------------------------------------------------] \| channels_last=False \| curr ch_last=True \| new ch_last=True 1 threads: ------------------------------------------------------------------------------------------------- [64, 256, 32, 32, 32] 4x4 stride 4 \| 15810.7 \| 33807.7 \| 16452.9 [64, 256, 32, 32, 32] 4x4 stride 2 \| 24422.7 \| 42515.3 \| 27700.3 [64, 256, 32, 32, 32] 4x4 stride 1 \| 71756.0 \| 89916.5 \| 106464.0 [64, 256, 32, 32, 32] 2x2 stride 4 \| 12102.9 \| 30210.4 \| 11319.8 [64, 256, 32, 32, 32] 2x2 stride 2 \| 19101.7 \| 37210.8 \| 20373.3 [64, 256, 32, 32, 32] 2x2 stride 1 \| 41418.0 \| 59650.5 \| 53009.2 [64, 256, 16, 16, 16] 4x4 stride 4 \| 2362.0 \| 4210.3 \| 2114.0 [64, 256, 16, 16, 16] 4x4 stride 2 \| 4102.4 \| 5897.4 \| 3179.7 [64, 256, 16, 16, 16] 4x4 stride 1 \| 11339.3 \| 13116.6 \| 10032.6 [64, 256, 16, 16, 16] 2x2 stride 4 \| 1709.7 \| 3506.7 \| 1423.6 [64, 256, 16, 16, 16] 2x2 stride 2 \| 2966.6 \| 4760.8 \| 2499.3 [64, 256, 16, 16, 16] 2x2 stride 1 \| 6998.4 \| 8797.3 \| 6152.0 [64, 256, 4, 4, 4] 4x4 stride 4 \| 173.0 \| 176.3 \| 127.9 [64, 256, 4, 4, 4] 4x4 stride 2 \| 149.1 \| 176.3 \| 125.5 [64, 256, 4, 4, 4] 4x4 stride 1 \| 150.0 \| 177.2 \| 125.6 [64, 256, 4, 4, 4] 2x2 stride 4 \| 158.0 \| 192.7 \| 127.9 [64, 256, 4, 4, 4] 2x2 stride 2 \| 169.7 \| 199.2 \| 125.3 [64, 256, 4, 4, 4] 2x2 stride 1 \| 289.6 \| 318.2 \| 116.5 [64, 64, 32, 32, 32] 4x4 stride 4 \| 3914.4 \| 6993.3 \| 4141.4 [64, 64, 32, 32, 32] 4x4 stride 2 \| 6107.4 \| 9186.4 \| 6378.5 [64, 64, 32, 32, 32] 4x4 stride 1 \| 17920.0 \| 20993.5 \| 23891.1 [64, 64, 32, 32, 32] 2x2 stride 4 \| 3029.7 \| 6112.6 \| 2895.6 [64, 64, 32, 32, 32] 2x2 stride 2 \| 4787.8 \| 7870.6 \| 4724.8 [64, 64, 32, 32, 32] 2x2 stride 1 \| 10366.4 \| 13446.4 \| 12603.8 [64, 64, 16, 16, 16] 4x4 stride 4 \| 605.8 \| 962.9 \| 499.7 [64, 64, 16, 16, 16] 4x4 stride 2 \| 1037.0 \| 1394.8 \| 791.6 [64, 64, 16, 16, 16] 4x4 stride 1 \| 2835.4 \| 3191.8 \| 2484.3 [64, 64, 16, 16, 16] 2x2 stride 4 \| 438.6 \| 795.7 \| 368.6 [64, 64, 16, 16, 16] 2x2 stride 2 \| 749.1 \| 1108.0 \| 612.0 [64, 64, 16, 16, 16] 2x2 stride 1 \| 1756.4 \| 2112.2 \| 1538.5 [64, 64, 4, 4, 4] 4x4 stride 4 \| 132.6 \| 163.9 \| 115.4 [64, 64, 4, 4, 4] 4x4 stride 2 \| 129.3 \| 153.7 \| 117.8 [64, 64, 4, 4, 4] 4x4 stride 1 \| 128.0 \| 153.8 \| 117.6 [64, 64, 4, 4, 4] 2x2 stride 4 \| 128.2 \| 154.1 \| 117.5 [64, 64, 4, 4, 4] 2x2 stride 2 \| 130.5 \| 157.3 \| 117.6 [64, 64, 4, 4, 4] 2x2 stride 1 \| 128.8 \| 156.4 \| 120.6 [64, 3, 32, 32, 32] 4x4 stride 4 \| 200.4 \| 261.0 \| 228.8 [64, 3, 32, 32, 32] 4x4 stride 2 \| 305.3 \| 366.5 \| 344.4 [64, 3, 32, 32, 32] 4x4 stride 1 \| 860.9 \| 922.1 \| 1136.0 [64, 3, 32, 32, 32] 2x2 stride 4 \| 157.0 \| 216.9 \| 158.1 [64, 3, 32, 32, 32] 2x2 stride 2 \| 240.5 \| 300.9 \| 247.7 [64, 3, 32, 32, 32] 2x2 stride 1 \| 503.5 \| 565.1 \| 609.8 [64, 3, 16, 16, 16] 4x4 stride 4 \| 136.0 \| 159.0 \| 120.3 [64, 3, 16, 16, 16] 4x4 stride 2 \| 131.2 \| 156.9 \| 120.0 [64, 3, 16, 16, 16] 4x4 stride 1 \| 146.6 \| 158.5 \| 123.8 [64, 3, 16, 16, 16] 2x2 stride 4 \| 133.8 \| 158.4 \| 117.1 [64, 3, 16, 16, 16] 2x2 stride 2 \| 132.1 \| 160.8 \| 117.9 [64, 3, 16, 16, 16] 2x2 stride 1 \| 133.7 \| 174.4 \| 118.0 [64, 3, 4, 4, 4] 4x4 stride 4 \| 156.8 \| 166.2 \| 119.4 [64, 3, 4, 4, 4] 4x4 stride 2 \| 126.8 \| 150.4 \| 118.2 [64, 3, 4, 4, 4] 4x4 stride 1 \| 125.2 \| 151.7 \| 117.8 [64, 3, 4, 4, 4] 2x2 stride 4 \| 127.3 \| 152.7 \| 116.2 [64, 3, 4, 4, 4] 2x2 stride 2 \| 128.6 \| 153.3 \| 114.6 [64, 3, 4, 4, 4] 2x2 stride 1 \| 128.6 \| 153.5 \| 114.7 [16, 256, 32, 32, 32] 4x4 stride 4 \| 3921.7 \| 8445.7 \| 4064.7 [16, 256, 32, 32, 32] 4x4 stride 2 \| 6111.7 \| 10630.0 \| 6944.4 [16, 256, 32, 32, 32] 4x4 stride 1 \| 17938.9 \| 22896.8 \| 26648.7 [16, 256, 32, 32, 32] 2x2 stride 4 \| 3029.6 \| 7552.7 \| 2840.9 [16, 256, 32, 32, 32] 2x2 stride 2 \| 4788.0 \| 9322.1 \| 5110.5 [16, 256, 32, 32, 32] 2x2 stride 1 \| 10363.7 \| 14885.9 \| 13213.6 [16, 256, 16, 16, 16] 4x4 stride 4 \| 606.0 \| 1059.1 \| 535.9 [16, 256, 16, 16, 16] 4x4 stride 2 \| 1037.5 \| 1491.5 \| 822.3 [16, 256, 16, 16, 16] 4x4 stride 1 \| 2835.4 \| 3306.8 \| 2522.8 [16, 256, 16, 16, 16] 2x2 stride 4 \| 438.6 \| 892.3 \| 369.0 [16, 256, 16, 16, 16] 2x2 stride 2 \| 749.2 \| 1203.7 \| 638.7 [16, 256, 16, 16, 16] 2x2 stride 1 \| 1756.1 \| 2212.5 \| 1547.0 [16, 256, 4, 4, 4] 4x4 stride 4 \| 159.6 \| 187.6 \| 117.6 [16, 256, 4, 4, 4] 4x4 stride 2 \| 161.1 \| 185.5 \| 117.3 [16, 256, 4, 4, 4] 4x4 stride 1 \| 160.0 \| 148.1 \| 117.8 [16, 256, 4, 4, 4] 2x2 stride 4 \| 123.9 \| 148.3 \| 117.6 [16, 256, 4, 4, 4] 2x2 stride 2 \| 126.0 \| 151.7 \| 117.4 [16, 256, 4, 4, 4] 2x2 stride 1 \| 127.1 \| 152.3 \| 117.9 [16, 64, 32, 32, 32] 4x4 stride 4 \| 983.5 \| 1756.7 \| 1067.8 [16, 64, 32, 32, 32] 4x4 stride 2 \| 1542.4 \| 2315.2 \| 1621.5 [16, 64, 32, 32, 32] 4x4 stride 1 \| 4498.7 \| 5273.4 \| 6006.7 [16, 64, 32, 32, 32] 2x2 stride 4 \| 767.2 \| 1543.4 \| 736.7 [16, 64, 32, 32, 32] 2x2 stride 2 \| 1207.8 \| 1981.5 \| 1197.0 [16, 64, 32, 32, 32] 2x2 stride 1 \| 2603.3 \| 3367.5 \| 3161.9 [16, 64, 16, 16, 16] 4x4 stride 4 \| 169.5 \| 264.6 \| 142.8 [16, 64, 16, 16, 16] 4x4 stride 2 \| 274.6 \| 368.9 \| 216.8 [16, 64, 16, 16, 16] 4x4 stride 1 \| 723.3 \| 820.4 \| 643.2 [16, 64, 16, 16, 16] 2x2 stride 4 \| 131.4 \| 216.0 \| 116.1 [16, 64, 16, 16, 16] 2x2 stride 2 \| 199.9 \| 295.0 \| 166.8 ``` CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/80748 Approved by: https://github.com/ngimel	2022-07-08 04:26:01 +00:00
Michael Gschwind	25449292a0	Run mask test with and without nested tensor (#81008 ) Summary: Run mask test with and without nested tensor Test Plan: sandcastle Differential Revision: D37665532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81008 Approved by: https://github.com/malfet	2022-07-07 23:54:37 +00:00
Animesh Jain	1d90d6ee60	Setup for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106 ) @ezyang I am going to keep adding more skips in this PR for now. And once we have the CI running, I will replace with the appropriate decorators. cc @mlazos , we should add those tests in test_ops.py in this PR as well cc @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/80106 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-07-07 18:57:33 +00:00
albanD	c8d64ba5ec	Allow register float16 weight_norm on cpu and speed up test (#80600 ) Fixes https://github.com/pytorch/pytorch/issues/80599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80600 Approved by: https://github.com/malfet	2022-06-30 13:50:39 +00:00
otaj	db52e4b7d9	Bugfix/weakref (#80139 ) Fixes #78580 I'm back! :) cc @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/80139 Approved by: https://github.com/albanD	2022-06-28 14:51:42 +00:00
Rohit Goswami	72e40d2bc7	BUG: Evade segfault by throwing a RuntimeError for `nn.ChannelShuffle` and empty input tensors (#77029 ) Fixes #76616. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77029 Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser	2022-06-23 21:14:02 +00:00
Michael Gschwind	bcc02769be	Check for contiguous well-formed mask (#79927 ) Summary: Check for contiguous well-formed mask Test Plan: sandcastle, github CI Reviewed By: frank-wei Differential Revision: D37301243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79927 Approved by: https://github.com/jbschlosser	2022-06-23 15:41:04 +00:00
Alex Hedges	cb2b7b1e57	Fix code that triggers BytesWarning (#79868 ) Fixes #74812. I have fixed the multiple instances in the repository that trigger `BytesWarning`, and I have enabled the `-bb` option when tests are run to prevent regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79868 Approved by: https://github.com/janeyx99	2022-06-21 01:12:21 +00:00
Joel Benjamin Schlosser	5953fd9133	Revert behavior of Dropout2d on 3D inputs to 1D channel-wise dropout behavior & warn Pull Request resolved: https://github.com/pytorch/pytorch/pull/79549 Approved by: https://github.com/ngimel, https://github.com/albanD	2022-06-15 14:56:43 +00:00
Joel Benjamin Schlosser	2d73c8e6e0	Add Dropout1d module Pull Request resolved: https://github.com/pytorch/pytorch/pull/79545 Approved by: https://github.com/ngimel, https://github.com/albanD	2022-06-15 14:39:07 +00:00
Kurt Mohler	4cfd09d7bc	Reland: Add index value checking to MaxUnpool2d and MaxUnpool3d (#78280 ) Relanding #70545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78280 Approved by: https://github.com/jbschlosser	2022-06-03 20:09:07 +00:00
samdow	b7cb4eae6b	Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560 ) On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](`369d9f4137`) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked. What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560 Approved by: https://github.com/soulitzer, https://github.com/albanD	2022-06-03 19:14:51 +00:00
Eddie Yan	14b0e9e75f	[cuDNN] Don't enforce bitwise exact results in `test_conv_transposed_large_cuda` (#78147 ) `test_conv_transposed_large` expects bitwise perfect results in fp16 on CUDA, but this behavior isn't guaranteed by cuDNN (e.g., in the case of FFT algos). This PR just changes the tolerance on the test to account for these cases. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/78147 Approved by: https://github.com/ngimel	2022-06-03 19:03:24 +00:00
Eddie Yan	b740a99b9e	[cuDNN][TF32] Threshold adjustments for TF32 on `>=sm80` (#78437 ) CC @ptrblck @mcarilli Change to transformer multilayer test can potentially be swapped in favor of an rtol change? (see also: #75612). Pull Request resolved: https://github.com/pytorch/pytorch/pull/78437 Approved by: https://github.com/ngimel	2022-06-03 01:02:56 +00:00
PyTorch MergeBot	d578197747	Revert "Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560 )" This reverts commit `ce7c7bb2a9`. Reverted https://github.com/pytorch/pytorch/pull/78560 on behalf of https://github.com/malfet due to broke XLA (on CI and trunk), see `ce7c7bb2a9`	2022-06-02 17:40:34 +00:00
samdow	ce7c7bb2a9	Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560 ) On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](`369d9f4137`) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked. What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560 Approved by: https://github.com/soulitzer, https://github.com/albanD	2022-06-02 13:40:21 +00:00
Edward Z. Yang	c20969c40c	Fix ParameterList printing meta tensor Fixes https://github.com/pytorch/pytorch/issues/78250 There are actually two bugs. First, the crash is caused by TensorOptions::backend incorrectly reporting noexcept when it can failed. Second, ParameterList is using torch.tensortype for no good reason; we can just print the dtype instead. Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/78529 Approved by: https://github.com/albanD	2022-06-01 00:46:52 +00:00
mikeiovine	d6db5ea50d	Back out "add mixed data type mode for LayerNorm forward path" Pull Request resolved: https://github.com/pytorch/pytorch/pull/78298 Also back out "improve LayerNorm bfloat16 performance on CPU". These layer norm changes seem fine, but they are causing `LayerNorm` to not use AVX2 instructions, which is causing performance on internal models to degrade. More investigation is needed to find the true root cause, but we should unland to mitigate the issue ASAP. I left `mixed_data_type.h` around since there are some other files depending on it. Differential Revision: [D36675352](https://our.internmc.facebook.com/intern/diff/D36675352/) Approved by: https://github.com/tenpercent	2022-05-26 02:54:13 +00:00
PyTorch MergeBot	c50089712c	Revert "Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545 )" This reverts commit `53ef66bb59`. Reverted https://github.com/pytorch/pytorch/pull/70545 on behalf of https://github.com/malfet due to as it broke cuda-10.2 test on trunk, see `53ef66bb59`	2022-05-23 23:58:43 +00:00
Kurt Mohler	53ef66bb59	Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545 ) Fixes #68727 cc @mruberry @jbschlosser @walterddr @kshitij12345 @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/70545 Approved by: https://github.com/ngimel	2022-05-23 21:08:25 +00:00
yuguo68	c186250d95	raise error when groups is not positive in Conv modules Pull Request resolved: https://github.com/pytorch/pytorch/pull/77919 Approved by: https://github.com/jbschlosser	2022-05-23 20:35:00 +00:00
Jeff Daily	9aed30d3ad	[ROCm] support benchmark flag for MIOpen (#77438 ) Fixes #68172. Generally, this corrects multiple flaky convolution unit test behavior seen on ROCm. The MIOpen integration has been forcing benchmark=True when calling `torch._C._set_cudnn_benchmark(False)`, typically called by `torch.backends.cudnn.set_flags(enabled=True, benchmark=False)`. We now add support for MIOpen immediate mode to avoid benchmarking during MIOpen solution selection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77438 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-05-23 17:10:24 +00:00
zrphercule	734a97a7c8	Revert "Revert "Switch to use nested tensor by-default in Transformer… (#77924 ) …Encoder (#77217)"" This reverts commit `0d6fa91d1b`. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/77924 Approved by: https://github.com/atalman	2022-05-20 11:44:03 +00:00
George Qi	f9db8b72ac	MHA forward pass bug fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/77761 Approved by: https://github.com/jbschlosser	2022-05-19 01:21:24 +00:00
Joel Benjamin Schlosser	8881d7ac6c	Support no-batch-dim for CrossEntropyLoss with prob target Pull Request resolved: https://github.com/pytorch/pytorch/pull/77653 Approved by: https://github.com/albanD	2022-05-18 19:51:09 +00:00
Nikita Vedeneev	a760dc2687	`binary_cross_entropy`: double backwart wrt target (#77416 ) As per title. An effort to make `binary_cross_entropy` all around differentiable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77416 Approved by: https://github.com/soulitzer	2022-05-18 10:29:27 +00:00
Rui Zhu	4e2f5507d0	Add support for TxT mask layout for masked_softmax in BetterTransformer (#77607 ) Summary: Expand mask to BxHxDxD when mask is DxD layout Test Plan: buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/opt/gen/caffe2/test/nn\#binary.par -r masked_softmax_DxD Differential Revision: D36428170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77607 Approved by: https://github.com/cpuhrsch	2022-05-18 01:31:05 +00:00
PyTorch MergeBot	d8b80edade	Revert "Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435 )" This reverts commit `1aa3cbb83b`. Reverted https://github.com/pytorch/pytorch/pull/76435 on behalf of https://github.com/jbschlosser	2022-05-17 17:51:26 +00:00
mingfeima	c003494754	add channels last support for PixelShuffle and PixelUnshuffle Pull Request resolved: https://github.com/pytorch/pytorch/pull/50573 Approved by: https://github.com/VitalyFedyunin	2022-05-17 17:33:49 +00:00
Edward Z. Yang	b5bc954a71	Fix optional dtype/layout/memory_format pycall; fix memory format Double-header bug fix: - As reported by jansel, dtypes are still showing up as integers when the schema is an optional dtype. This is simple enough to fix and I added a test for it. But while I was at it... - I noticed that the THPMemoryFormat_new idiom with "unused" name doesn't actually work, the repr of the returned memory format object is wrong and this shows up when we try to log the args/kwargs. So I fixed memory format to do it properly along with everything else. Fixes https://github.com/pytorch/pytorch/issues/77135 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/77543 Approved by: https://github.com/albanD, https://github.com/jansel	2022-05-16 16:46:08 +00:00
mingfeima	8c50414233	add BFloat16 support for BatchNorm on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/77496 Approved by: https://github.com/frank-wei	2022-05-16 16:31:18 +00:00
mingfeima	6fa20bdfe8	add native kernel for weight_norm on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/73845 Approved by: https://github.com/frank-wei	2022-05-16 06:36:24 +00:00
PyTorch MergeBot	93a969221d	Revert "add BFloat16 support for BatchNorm on CPU" This reverts commit `7c8911ca7a`. Reverted https://github.com/pytorch/pytorch/pull/74410 on behalf of https://github.com/albanD	2022-05-14 14:28:58 +00:00
mingfeima	7c8911ca7a	add BFloat16 support for BatchNorm on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/74410 Approved by: https://github.com/frank-wei	2022-05-14 07:49:00 +00:00
Rohan Varma	a275491c6f	[Reland] load_state_dict post hook (#77392 ) Reland of https://github.com/pytorch/pytorch/pull/76823 with fixes to call `__setstate__` for softmax/softmin/logsoftmax as per discussion with @albanD and @jbschlosser. Original description: Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287. Unittests cover: - Ensuring hooks are called with the correct module - Hook is called with `IncompatibleKeys` field - If hook modifies this, load_state_dict returns the modified result Pull Request resolved: https://github.com/pytorch/pytorch/pull/77392 Approved by: https://github.com/jbschlosser	2022-05-14 06:06:23 +00:00
mingfeima	59b56ba785	improve group_norm channels last performance on CPU add channels_last_3d memory format support add BFloat16 support on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/69067 Approved by: https://github.com/VitalyFedyunin	2022-05-14 03:13:02 +00:00
Kulin Seth	e011a8e18b	Enable PyTorch operations on MPS Backend. (#77343 ) Add PyTorch operations to MPS backend. - https://github.com/pytorch/pytorch/issues/77394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77343 Approved by: https://github.com/albanD	2022-05-13 18:28:53 +00:00
mingfeima	2b7943c47c	fix torchvhsion failed case test_classification_model on slow_conv2d Pull Request resolved: https://github.com/pytorch/pytorch/pull/77347 Approved by: https://github.com/datumbox, https://github.com/frank-wei	2022-05-13 08:04:08 +00:00
PyTorch MergeBot	d92b0a51aa	Revert "Load state dict post hook" This reverts commit `56bed0dcfe`. Reverted https://github.com/pytorch/pytorch/pull/76823 on behalf of https://github.com/rohan-varma	2022-05-12 21:00:49 +00:00
ecao	37c6017831	Add BFloat16 support for GLU, and randperm operators on CPU (#61944 ) add BFloat16 support for GLU and randperm operators on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/61944 Approved by: https://github.com/frank-wei	2022-05-12 17:41:57 +00:00
yanbing-j	4f82f439d1	Enable BFloat16 ELU, SELU and CELU in CPU path (#62546 ) Enable BFloat16 ELU, SELU and CELU in CPU path. SELU and CELU will call ELU implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62546 Approved by: https://github.com/frank-wei	2022-05-12 16:56:57 +00:00
mingfeima	3b56efd4e1	add mixed data type mode for LayerNorm forward path Pull Request resolved: https://github.com/pytorch/pytorch/pull/73844 Approved by: https://github.com/frank-wei	2022-05-12 03:35:06 +00:00
otaj	1aa3cbb83b	Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435 ) Fixes #76434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76435 Approved by: https://github.com/jbschlosser	2022-05-11 18:40:59 +00:00
mingfeima	3d0e6f169c	add channels last support for slow_conv_dilated2d Pull Request resolved: https://github.com/pytorch/pytorch/pull/70665 Approved by: https://github.com/VitalyFedyunin	2022-05-11 15:28:50 +00:00
Rui Zhu	533b44a280	Add _native nested_tensor_from_mask (#76942 ) Summary: For user to convert nested tensor more easily. Some impl detail might change on user's need. Test Plan: buck test mode/dev caffe2/test:nn -- test_nested_tensor_from_mask Differential Revision: D36191182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76942 Approved by: https://github.com/jbschlosser	2022-05-11 05:19:36 +00:00
mingfeima	3d561ee926	add channels last support for thnn_conv2d (non-dilated) Pull Request resolved: https://github.com/pytorch/pytorch/pull/68101 Approved by: https://github.com/VitalyFedyunin	2022-05-11 00:09:45 +00:00
neverix	87e543da9b	Add `load_state_dict` error message for non-dicts (#77197 ) Fixes #76886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77197 Approved by: https://github.com/jbschlosser	2022-05-10 22:11:51 +00:00
Aidyn-A	a127c584a0	Fix max pool forward nhwc (#76597 ) Fixes issue #76432. Added dilation to loops in CUDA kernel. cc @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/76597 Approved by: https://github.com/ngimel	2022-05-10 17:39:48 +00:00
mingfeima	8d4e069e66	add BFloat16 support for UpSample on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/76935 Approved by: https://github.com/frank-wei	2022-05-10 16:56:41 +00:00
Scott Wolchok	e5915a2216	[PyTorch] Don't enter MHA fast path when bias & query dtypes don't match Pull Request resolved: https://github.com/pytorch/pytorch/pull/76879 The fast path does not support this: transform_bias_rescale_qkv will try to grab bias.data_ptr() assuming the dtypes are the same. (Also, I have no idea how this happens.) Differential Revision: [D36156872](https://our.internmc.facebook.com/intern/diff/D36156872/) Approved by: https://github.com/cpuhrsch	2022-05-09 18:21:04 +00:00
Rohan Varma	56bed0dcfe	Load state dict post hook Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287. Unittests cover: - Ensuring hooks are called with the correct module - Hook is called with `IncompatibleKeys` field - If hook modifies this, load_state_dict returns the modified result Pull Request resolved: https://github.com/pytorch/pytorch/pull/76823 Approved by: https://github.com/albanD	2022-05-05 19:27:05 +00:00
lkct	b8776e143f	Fix false DeprecationWarning in `Module.state_dict` Fixes #75404 TODO: - [x] add tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/75507 Approved by: https://github.com/jbschlosser	2022-05-04 20:08:23 +00:00
Nikita Shulga	b074bffa41	Revert D28836788: add BFloat16 support for UpSample on CPU Test Plan: revert-hammer Differential Revision: D28836788 (`1399d83bc0`) Original commit changeset: 63dc45e5bb91 Original Phabricator Diff: D28836788 (`1399d83bc0`) fbshipit-source-id: 92733af87cba87aed800473ff44ca6d7af037da9 (cherry picked from commit 1c9fc492503b768a343723e4cf347b30bf5dcfc2)	2022-05-02 23:13:39 +00:00
mingfeima	1399d83bc0	add BFloat16 support for UpSample on CPU (#58297 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58297 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D28836788 Pulled By: VitalyFedyunin fbshipit-source-id: 63dc45e5bb91964d5ff1110262228718289435d1 (cherry picked from commit 8a37d607d6a89ccb50364cf54a6f26ca8d05cab9)	2022-05-02 22:33:26 +00:00
Scott Wolchok	e816e17655	[PyTorch] Add native fast path for transformer encoder inference (#76333 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76333 The current PyTorch multi-head attention and transformer implementations are slow. This should speed them up for inference. ghstack-source-id: 154737857 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: cpuhrsch Differential Revision: D35239925 fbshipit-source-id: 5a7eb8ff79bc6afb4b7d45075ddb2a24a6e2df28	2022-04-26 12:58:03 -04:00
Jon Janzen	2387efd356	Revert "[PyTorch] Add native fast path for transformer encoder inference" This reverts commit `b369b89f23`. This has internal changes and should not have been landed via mergebot. Ref: https://github.com/pytorch/pytorch/pull/75809#issuecomment-1108717166	2022-04-25 11:40:02 -04:00
Scott Wolchok	b369b89f23	[PyTorch] Add native fast path for transformer encoder inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/75809 The current PyTorch multi-head attention and transformer implementations are slow. This should speed them up for inference. Differential Revision: [D35239925](https://our.internmc.facebook.com/intern/diff/D35239925/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35239925/)! Approved by: https://github.com/ezyang	2022-04-25 06:11:36 +00:00
Peter Bell	cb37e7a080	Remove F.pad python implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/73433 Approved by: https://github.com/albanD, https://github.com/jbschlosser	2022-04-23 00:13:20 +00:00
Joel Benjamin Schlosser	041e6e750a	Fix to support no-batch-dim inputs in ConvTransposeNd._output_padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/76151 Approved by: https://github.com/albanD	2022-04-22 19:25:09 +00:00
Nikita Vedeneev	9e137ee583	more numerically stable cosine_similarity Previous behavior: compute inner product, then normalize. This patch: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms. By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [#29442](https://github.com/pytorch/pytorch/issues/29442). P.S. I had to change tests because this implementation handles division by 0 differently. This PR computes cosine similarity as follows: <x/max(eps, \|\|x\|\|), y/max(eps, \|\|y\|\|)>. Let f(x,y) = <x,y>/(\|\|x\|\| * \|\|y\|\|), then df/dx = y/(\|\|x\|\| * \|\|y\|\|) - (\|\|y\|\|/\|\|x\|\| * <x,y> * x)/(\|\|x\|\| * \|\|y\|\|)^2. The changed test checks division by zero in backward when x=0 and y != 0. For this case the non-zero part of the gradient is just y / (\|\|x\|\| * \|\|y\|\|). The previous test evaluates y/(\|\|x\|\| * \|\|y\|\|) to y / eps, and this PR to 1/eps * y/\|\|y\|\|. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31378 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-04-22 09:28:50 +00:00
arindamroy-eng	7478ce187a	ROCM:Unskip more tests for ROCM5.0 Re-enabling more tests which are working on ROCM5.0 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/75353 Approved by: https://github.com/ezyang	2022-04-19 19:45:55 +00:00
George Qi	f5517761aa	add operator header Pull Request resolved: https://github.com/pytorch/pytorch/pull/71502 Approved by: https://github.com/zrphercule, https://github.com/cpuhrsch	2022-04-19 15:23:25 +00:00
yanbing-j	dc2e630341	Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path (#63634 ) Summary: In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU. The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization. The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel. In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation. We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`. Share weights: ![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png) ![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png) Multiple weights: ![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png) ![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png) cc albanD mruberry jbschlosser walterddr Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634 Reviewed By: yinghai Differential Revision: D34031616 Pulled By: frank-wei fbshipit-source-id: 04e2a0f9e92c658fba7ff56b1010eacb7e8ab44c (cherry picked from commit ed262b15487557720bb0d498f9f2e8fcdba772d9)	2022-04-15 21:46:24 +00:00
PyTorch MergeBot	e8ed042043	Revert "Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path" This reverts commit `263c4c2a95`. Reverted https://github.com/pytorch/pytorch/pull/63634 on behalf of https://github.com/seemethere	2022-04-15 21:41:51 +00:00
yanbing-j	263c4c2a95	Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU. The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization. The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel. In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation. We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`. Share weights: ![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png) ![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png) Multiple weights: ![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png) ![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png) cc @albanD @mruberry @jbschlosser @walterddr Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634 Approved by: https://github.com/frank-wei, https://github.com/seemethere	2022-04-15 20:34:58 +00:00
Scott Wolchok	56f801e788	[PyTorch] Add test for all-masked case for native softmax It returns all NaNs. CUDA implementation required a fix for this. Differential Revision: [D35327730](https://our.internmc.facebook.com/intern/diff/D35327730/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75803 Approved by: https://github.com/ngimel	2022-04-14 21:30:57 +00:00
Scott Wolchok	d4c527e738	[PyTorch] Run test_transformerencoderlayer_gelu on CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/75347 Preparing to add native fast path; need to test on CUDA! Differential Revision: [D35327729](https://our.internmc.facebook.com/intern/diff/D35327729/) Approved by: https://github.com/ngimel	2022-04-14 21:30:57 +00:00

1 2 3 4 5 ...

1436 Commits