pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
eqy	42f0fe1fe3	fix misaligned access #56325 (#56403 ) Summary: CC ngimel ptrblck ref: https://github.com/pytorch/pytorch/issues/56325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/56403 Reviewed By: mruberry Differential Revision: D27866625 Pulled By: ngimel fbshipit-source-id: 9dff0e9749f8de57fac6a653f685c14854611a02	2021-04-19 20:12:03 -07:00
Jeffrey Wan	dd8bfe2b93	Finish deprecation cycle for inplace view error checks (#56093 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/50617 Also updates the relevant tests to expect errors instead of warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/56093 Reviewed By: agolynski Differential Revision: D27806795 Pulled By: soulitzer fbshipit-source-id: 93c5c28edb1f97fa4457332c2ef4711f050ac81f	2021-04-16 10:44:58 -07:00
Jerry Zhang	0a541e23e1	[nn] Add allow_duplicate option for named_modules (#54812 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54812 Needed for quantization since different attribute might refer to the same module instance Test Plan: Imported from OSS Reviewed By: vkuzo Differential Revision: D27408376 fbshipit-source-id: cada85c4a1772d3dd9502c3f6f9a56d690d527e7	2021-04-16 01:26:16 -07:00
h6197627	f02454f957	Fix ChanelShuffle named tensor warnings (#55911 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/54846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/55911 Reviewed By: agolynski Differential Revision: D27798078 Pulled By: jbschlosser fbshipit-source-id: 1ebd325ac8a21f82c395d2eafac7ef2ecd1f32b1	2021-04-15 15:36:35 -07:00
Peter Bell	1934725875	Use cascade summation in nll_loss on CPU (#55841 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/55657 This also avoids summing `total_weight_val` when weights aren't supplied. Avoiding accumulated error completely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55841 Reviewed By: jbschlosser Differential Revision: D27751492 Pulled By: ngimel fbshipit-source-id: 2c2dc48f31c25dfa9db48693e3f765b179771a3c	2021-04-15 09:10:35 -07:00
S.Cao	416c18b7c9	Add a batch_first arg to Transformer / MHA modules (#55285 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/25100 #43112 EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285 Reviewed By: mruberry Differential Revision: D27765694 Pulled By: jbschlosser fbshipit-source-id: c34774fa065d67c0ac130de20a54e66e608bdbf4	2021-04-14 11:18:42 -07:00
Kurt Mohler	3fe4718d16	Add `padding_idx` argument to EmbeddingBag (#49237 ) Summary: This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction. This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided. Fixes https://github.com/pytorch/pytorch/issues/3194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237 Reviewed By: walterddr, VitalyFedyunin Differential Revision: D26948258 Pulled By: jbschlosser fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc	2021-04-14 09:38:01 -07:00
Vitaly Fedyunin	2bf26965e7	Revert D27710107: [pytorch][PR] Update a `batch_first` arg for transformers like GRU and LSTM. Test Plan: revert-hammer Differential Revision: D27710107 (`2237754b13`) Original commit changeset: c4363a460454 fbshipit-source-id: 5387b5deae6db43f17a7d5e0408a7d24e463d73a	2021-04-13 16:22:23 -07:00
S.Cao	2237754b13	Update a `batch_first` arg for transformers like GRU and LSTM. (#55285 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/25100 #43112 EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285 Reviewed By: ngimel Differential Revision: D27710107 Pulled By: jbschlosser fbshipit-source-id: c4363a4604548c0d84628c4997dd23d6b3afb4d9	2021-04-13 14:54:50 -07:00
Yukio Siraichi	93bf0ae6fc	Remove legacy constructor calls from pytorch codebase. (#54142 ) Summary: Follow up from https://github.com/pytorch/pytorch/issues/53889 Related to https://github.com/pytorch/pytorch/issues/47112 Removing every occurrence of the legacy constructor call present in PyTorch at: - _docs_ - _benchmarks_ - _test_ - _caffe2_ - _CONTRIBUTING.md_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/54142 Reviewed By: ngimel Differential Revision: D27699450 Pulled By: mruberry fbshipit-source-id: 530aa3f5746cc8bc1407d5d51b2bbd8075e30546	2021-04-11 15:45:17 -07:00
Xiao Wang	55d45458bd	[cuDNN] Enable Conv3d channels_last_3d (#48430 ) Summary: This PR adds the functionality to use channals_last_3d, aka, NDHWC, in Conv3d. It's only enabled when cuDNN version is greater than or equal to 8.0.5. Todo: - [x] add memory_format test - [x] add random shapes functionality test Close https://github.com/pytorch/pytorch/pull/52547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48430 Reviewed By: mrshenli Differential Revision: D27641452 Pulled By: ezyang fbshipit-source-id: 0e98957cf30c50c3390903d307dd43bdafd28880	2021-04-09 07:56:49 -07:00
zsef123	3498fde20e	Add AccumulateType in AdaptiveAveragePooling3d.cu (#53607 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/52719 - Changed the type(`scalar_t`) of intermediate results to `at::acc_type<scalar_t, true>` This issue occurs by decimal precision of the half precision. Follows test cases of upper issue, The value range of input tensors are [0, 1] because init by `rand`. And when the kernel size 1, summations all target values and divide numel of kernel `34d9278c19/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu (L94-L95)` When adding [0, 1] values, if `sum` more than 2048 then not changed values. ( Even if the value is small, the mored exact value is added, but there are still precision issues.) (https://en.wikipedia.org/wiki/Half-precision_floating-point_format) Benchmarks - In V100 32GB, Driver : 450.80, cuda 10.1 - faster than prev <details><summary>Script</summary><p> ```import torch from torch.utils.benchmark import Timer torch.manual_seed(0) kernel_sizes = [1, 3, 5, 7, 9, 11, 13] shapes = [(12, 12, 12), (16, 16, 16), (16, 32, 32), (16, 56, 56), (16, 112, 112)] def run(batch, channel): print(f"Batch : {batch}, Channel : {channel} / (diff, diff / numel, time)") head = "\t".join(f"{str(s):30s}" for s in ["k \ shape"] + shapes) print(head) for kernel_size in kernel_sizes: kernel_size = (kernel_size, kernel_size, kernel_size) pool = torch.nn.AdaptiveAvgPool3d(kernel_size) print(f"{str(kernel_size):30s}", end="\t") for shape in shapes: x_half = torch.rand([batch, channel, shape], dtype=torch.half, device="cuda") x_float = x_half.float() y_half = pool(x_half) y_float = pool(x_float) timer = Timer("pool(x_half)", globals={"pool": pool, "x_half": x_half}) measurement = timer.blocked_autorange(min_run_time=5) diff = (y_float - y_half).abs().sum().item() diff = f"{diff:.4f}, {diff / y_half.numel():.6f}, {measurement.median 1e6 :3.2f}us" print(f"{diff:30s}", end="\t") print("") run(1, 1) run(1, 3) run(1, 54) run(1, 16) run(8, 1) run(8, 16) run(8, 54) import torch m = torch.nn.AdaptiveAvgPool3d((1,1,1)) inputs = torch.rand([8,54,16,56,56]) inputs = inputs.cuda() inputs_2 = inputs.half() print("Float") out = m(inputs).float() print("half") out2 = m(inputs_2).float() print('Discepancies', torch.sum(torch.abs(out2- out)).item(), torch.sum(torch.abs(out2- out)).item() / out.numel() , out.numel()) print("Sum : ", torch.sum(inputs, dim=(2,3,4))[0, 0], torch.sum(inputs_2, dim=(2,3,4))[0, 0]) ``` </p> </details> <details><summary>This commit</summary><p> ``` Batch : 1, Channel : 1 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0001, 0.000078, 55.73us 0.0001, 0.000079, 117.51us 0.0000, 0.000003, 379.60us 0.0000, 0.000046, 1046.21us 0.0001, 0.000139, 3897.17us (3, 3, 3) 0.0021, 0.000076, 22.04us 0.0031, 0.000115, 21.47us 0.0022, 0.000080, 41.63us 0.0030, 0.000111, 100.59us 0.0025, 0.000091, 295.04us (5, 5, 5) 0.0103, 0.000083, 21.65us 0.0097, 0.000078, 21.37us 0.0103, 0.000083, 21.60us 0.0114, 0.000091, 25.69us 0.0107, 0.000085, 97.06us (7, 7, 7) 0.0312, 0.000091, 21.52us 0.0290, 0.000084, 21.61us 0.0311, 0.000091, 21.60us 0.0309, 0.000090, 21.44us 0.0334, 0.000097, 33.60us (9, 9, 9) 0.0646, 0.000089, 21.57us 0.0672, 0.000092, 21.89us 0.0662, 0.000091, 21.89us 0.0684, 0.000094, 27.64us 0.0660, 0.000091, 54.85us (11, 11, 11) 0.1251, 0.000094, 21.68us 0.1194, 0.000090, 21.70us 0.1202, 0.000090, 21.72us 0.1233, 0.000093, 22.25us 0.1229, 0.000092, 41.39us (13, 13, 13) 0.2038, 0.000093, 21.57us 0.2047, 0.000093, 21.58us 0.1964, 0.000089, 21.54us 0.2021, 0.000092, 21.94us 0.1989, 0.000091, 40.01us Batch : 1, Channel : 3 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0003, 0.000110, 55.74us 0.0003, 0.000093, 118.62us 0.0003, 0.000093, 382.12us 0.0001, 0.000040, 1052.33us 0.0003, 0.000114, 3917.90us (3, 3, 3) 0.0073, 0.000090, 21.84us 0.0075, 0.000093, 22.25us 0.0072, 0.000089, 41.78us 0.0070, 0.000087, 100.27us 0.0069, 0.000086, 293.96us (5, 5, 5) 0.0353, 0.000094, 22.57us 0.0325, 0.000087, 21.64us 0.0343, 0.000092, 22.63us 0.0338, 0.000090, 25.82us 0.0332, 0.000089, 97.16us (7, 7, 7) 0.0937, 0.000091, 22.50us 0.0910, 0.000088, 21.92us 0.0933, 0.000091, 21.99us 0.0948, 0.000092, 21.56us 0.0928, 0.000090, 34.17us (9, 9, 9) 0.1957, 0.000089, 21.68us 0.1984, 0.000091, 21.57us 0.2025, 0.000093, 22.10us 0.1986, 0.000091, 27.66us 0.2020, 0.000092, 55.32us (11, 11, 11) 0.3585, 0.000090, 21.75us 0.3684, 0.000092, 22.70us 0.3706, 0.000093, 21.67us 0.3752, 0.000094, 21.86us 0.3663, 0.000092, 41.22us (13, 13, 13) 0.5931, 0.000090, 21.67us 0.6056, 0.000092, 21.79us 0.6005, 0.000091, 21.79us 0.6112, 0.000093, 21.69us 0.6034, 0.000092, 40.02us Batch : 1, Channel : 54 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0051, 0.000095, 55.76us 0.0060, 0.000112, 118.60us 0.0036, 0.000067, 381.50us 0.0054, 0.000100, 1054.03us 0.0048, 0.000089, 4888.68us (3, 3, 3) 0.1332, 0.000091, 21.66us 0.1344, 0.000092, 22.62us 0.1354, 0.000093, 45.72us 0.1364, 0.000094, 106.63us 0.1324, 0.000091, 448.31us (5, 5, 5) 0.6221, 0.000092, 22.48us 0.6220, 0.000092, 21.71us 0.6053, 0.000090, 27.65us 0.6137, 0.000091, 31.40us 0.6209, 0.000092, 172.78us (7, 7, 7) 1.6859, 0.000091, 22.42us 1.6972, 0.000092, 21.96us 1.6849, 0.000091, 23.14us 1.7012, 0.000092, 26.25us 1.6920, 0.000091, 75.58us (9, 9, 9) 3.5811, 0.000091, 21.73us 3.5746, 0.000091, 22.55us 3.6237, 0.000092, 27.66us 3.6046, 0.000092, 59.71us 3.6392, 0.000092, 168.15us (11, 11, 11) 6.5582, 0.000091, 22.05us 6.5746, 0.000091, 21.74us 6.5955, 0.000092, 32.91us 6.5644, 0.000091, 45.57us 6.5697, 0.000091, 114.01us (13, 13, 13) 10.6384, 0.000090, 21.81us 10.8608, 0.000092, 21.79us 10.8375, 0.000091, 37.01us 10.8662, 0.000092, 51.80us 10.8593, 0.000092, 123.19us Batch : 1, Channel : 16 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0015, 0.000093, 55.75us 0.0012, 0.000075, 118.10us 0.0013, 0.000079, 379.25us 0.0012, 0.000075, 1047.21us 0.0013, 0.000079, 4451.57us (3, 3, 3) 0.0407, 0.000094, 21.82us 0.0395, 0.000091, 21.69us 0.0385, 0.000089, 42.07us 0.0397, 0.000092, 100.33us 0.0384, 0.000089, 363.31us (5, 5, 5) 0.1858, 0.000093, 21.76us 0.1799, 0.000090, 21.63us 0.1834, 0.000092, 21.76us 0.1890, 0.000095, 26.04us 0.1814, 0.000091, 135.32us (7, 7, 7) 0.4937, 0.000090, 21.65us 0.5076, 0.000092, 21.69us 0.5001, 0.000091, 22.31us 0.4988, 0.000091, 21.59us 0.5123, 0.000093, 50.03us (9, 9, 9) 1.0678, 0.000092, 21.73us 1.0752, 0.000092, 21.75us 1.0673, 0.000091, 21.75us 1.0649, 0.000091, 30.01us 1.0786, 0.000092, 70.92us (11, 11, 11) 1.9591, 0.000092, 21.57us 1.9522, 0.000092, 21.60us 1.9566, 0.000092, 21.73us 1.9475, 0.000091, 23.46us 1.9323, 0.000091, 55.02us (13, 13, 13) 3.1784, 0.000090, 22.02us 3.2165, 0.000092, 21.95us 3.1969, 0.000091, 21.92us 3.2061, 0.000091, 24.40us 3.2578, 0.000093, 56.00us Batch : 8, Channel : 1 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0010, 0.000122, 55.74us 0.0009, 0.000114, 118.82us 0.0006, 0.000074, 379.80us 0.0009, 0.000107, 1047.31us 0.0008, 0.000102, 3900.36us (3, 3, 3) 0.0219, 0.000101, 21.57us 0.0200, 0.000093, 21.61us 0.0194, 0.000090, 41.74us 0.0208, 0.000096, 99.91us 0.0212, 0.000098, 293.03us (5, 5, 5) 0.0906, 0.000091, 21.46us 0.0911, 0.000091, 21.60us 0.0934, 0.000093, 21.93us 0.0927, 0.000093, 25.74us 0.0913, 0.000091, 96.85us (7, 7, 7) 0.2530, 0.000092, 22.53us 0.2526, 0.000092, 22.46us 0.2558, 0.000093, 22.03us 0.2542, 0.000093, 22.29us 0.2475, 0.000090, 34.44us (9, 9, 9) 0.5305, 0.000091, 22.34us 0.5368, 0.000092, 22.42us 0.5265, 0.000090, 21.74us 0.5370, 0.000092, 27.81us 0.5416, 0.000093, 55.65us (11, 11, 11) 0.9887, 0.000093, 21.80us 0.9660, 0.000091, 21.61us 0.9793, 0.000092, 22.11us 0.9719, 0.000091, 21.80us 0.9650, 0.000091, 43.90us (13, 13, 13) 1.6024, 0.000091, 21.87us 1.6198, 0.000092, 22.65us 1.6242, 0.000092, 21.73us 1.6236, 0.000092, 22.59us 1.6025, 0.000091, 42.77us Batch : 8, Channel : 16 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0113, 0.000088, 56.66us 0.0117, 0.000091, 119.57us 0.0130, 0.000102, 389.57us 0.0110, 0.000086, 1433.78us 0.0119, 0.000093, 5217.61us (3, 3, 3) 0.3209, 0.000093, 21.54us 0.3184, 0.000092, 22.87us 0.3115, 0.000090, 51.00us 0.3171, 0.000092, 164.17us 0.3182, 0.000092, 500.60us (5, 5, 5) 1.4391, 0.000090, 22.39us 1.4577, 0.000091, 21.69us 1.4601, 0.000091, 53.87us 1.4626, 0.000091, 93.65us 1.4567, 0.000091, 370.11us (7, 7, 7) 4.0501, 0.000092, 22.34us 4.0230, 0.000092, 31.45us 4.0381, 0.000092, 45.19us 4.0171, 0.000091, 65.35us 4.0108, 0.000091, 164.76us (9, 9, 9) 8.5360, 0.000091, 22.80us 8.5456, 0.000092, 27.24us 8.5461, 0.000092, 50.23us 8.5677, 0.000092, 117.63us 8.5645, 0.000092, 270.46us (11, 11, 11) 15.5521, 0.000091, 26.56us 15.5826, 0.000091, 32.81us 15.6014, 0.000092, 63.82us 15.5620, 0.000091, 96.87us 15.5722, 0.000091, 220.24us (13, 13, 13) 25.4146, 0.000090, 32.91us 25.7898, 0.000092, 38.48us 25.6698, 0.000091, 72.02us 25.8193, 0.000092, 121.73us 25.7718, 0.000092, 249.71us Batch : 8, Channel : 54 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0377, 0.000087, 109.07us 0.0405, 0.000094, 233.17us 0.0392, 0.000091, 998.97us 0.0393, 0.000091, 2960.68us 0.0408, 0.000094, 11879.53us (3, 3, 3) 1.0660, 0.000091, 25.68us 1.0761, 0.000092, 64.12us 1.0725, 0.000092, 182.50us 1.0801, 0.000093, 505.82us 1.0736, 0.000092, 1650.21us (5, 5, 5) 4.9587, 0.000092, 50.84us 4.9336, 0.000091, 47.38us 4.9696, 0.000092, 158.49us 4.9347, 0.000091, 237.39us 4.9303, 0.000091, 965.13us (7, 7, 7) 13.5409, 0.000091, 45.60us 13.5736, 0.000092, 87.45us 13.5012, 0.000091, 141.63us 13.6111, 0.000092, 181.51us 13.5296, 0.000091, 469.77us (9, 9, 9) 28.7817, 0.000091, 58.01us 28.7969, 0.000091, 77.61us 28.8761, 0.000092, 159.33us 28.8786, 0.000092, 334.47us 28.8093, 0.000091, 786.72us (11, 11, 11) 52.4453, 0.000091, 78.19us 52.7265, 0.000092, 95.12us 52.7322, 0.000092, 200.38us 52.6342, 0.000092, 282.41us 52.6467, 0.000092, 652.54us (13, 13, 13) 85.7411, 0.000090, 98.85us 86.7183, 0.000091, 115.28us 86.8545, 0.000092, 232.34us 86.9997, 0.000092, 367.32us 86.9083, 0.000092, 757.73us Float half Discepancies 0.03963914513587952 9.175728040712852e-05 432 Sum : tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16) ``` </p> </details> <details><summary>1.8.0</summary><p> ``` Batch : 1, Channel : 1 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0023, 0.002275, 74.35us 0.0040, 0.003985, 159.73us 0.3740, 0.374021, 546.59us 0.4587, 0.458663, 1543.16us 0.4906, 0.490637, 5945.97us (3, 3, 3) 0.0100, 0.000370, 20.37us 0.0230, 0.000852, 22.12us 0.0309, 0.001143, 54.75us 0.0520, 0.001926, 129.78us 7.1219, 0.263775, 377.11us (5, 5, 5) 0.0441, 0.000352, 20.06us 0.0394, 0.000316, 20.50us 0.0759, 0.000607, 26.43us 0.1499, 0.001199, 32.01us 0.2707, 0.002166, 128.15us (7, 7, 7) 0.0791, 0.000231, 20.10us 0.1002, 0.000292, 20.56us 0.1812, 0.000528, 20.48us 0.2424, 0.000707, 20.83us 0.4994, 0.001456, 43.97us (9, 9, 9) 0.1122, 0.000154, 20.55us 0.1778, 0.000244, 20.44us 0.2572, 0.000353, 20.15us 0.4149, 0.000569, 35.64us 0.7208, 0.000989, 68.46us (11, 11, 11) 0.2044, 0.000154, 20.47us 0.2647, 0.000199, 20.62us 0.3867, 0.000291, 20.61us 0.6059, 0.000455, 23.54us 1.0902, 0.000819, 53.32us (13, 13, 13) 0.3094, 0.000141, 20.53us 0.3843, 0.000175, 20.60us 0.5756, 0.000262, 20.80us 0.8598, 0.000391, 24.52us 1.4853, 0.000676, 47.70us Batch : 1, Channel : 3 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0054, 0.001801, 74.36us 0.0108, 0.003614, 158.94us 1.1183, 0.372768, 547.67us 1.3782, 0.459387, 1545.27us 1.4685, 0.489505, 5949.17us (3, 3, 3) 0.0308, 0.000380, 20.14us 0.0502, 0.000619, 22.11us 0.1210, 0.001493, 54.80us 0.1900, 0.002345, 130.47us 21.3483, 0.263560, 375.68us (5, 5, 5) 0.1179, 0.000314, 20.68us 0.1326, 0.000354, 20.53us 0.2662, 0.000710, 26.51us 0.4116, 0.001098, 31.85us 0.8369, 0.002232, 128.19us (7, 7, 7) 0.2335, 0.000227, 20.40us 0.3057, 0.000297, 20.43us 0.4954, 0.000481, 20.31us 0.7339, 0.000713, 20.74us 1.4208, 0.001381, 44.55us (9, 9, 9) 0.3326, 0.000152, 20.63us 0.5353, 0.000245, 20.42us 0.8025, 0.000367, 20.13us 1.2693, 0.000580, 35.64us 2.2096, 0.001010, 68.88us (11, 11, 11) 0.6121, 0.000153, 20.59us 0.8086, 0.000202, 20.42us 1.1700, 0.000293, 20.71us 1.8170, 0.000455, 23.54us 3.2117, 0.000804, 53.36us (13, 13, 13) 0.9165, 0.000139, 20.51us 1.1395, 0.000173, 20.56us 1.7343, 0.000263, 20.80us 2.5868, 0.000392, 24.59us 4.5823, 0.000695, 47.77us Batch : 1, Channel : 54 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.1092, 0.002023, 75.45us 0.1709, 0.003165, 160.44us 20.2452, 0.374911, 548.61us 24.7990, 0.459240, 1550.34us 26.4494, 0.489804, 6957.79us (3, 3, 3) 0.5352, 0.000367, 20.58us 1.0281, 0.000705, 24.14us 2.0150, 0.001382, 59.12us 3.3069, 0.002268, 138.23us 384.5216, 0.263732, 529.71us (5, 5, 5) 2.0739, 0.000307, 20.60us 2.5199, 0.000373, 20.44us 4.6916, 0.000695, 33.89us 7.9482, 0.001178, 37.74us 14.2553, 0.002112, 200.54us (7, 7, 7) 4.2236, 0.000228, 20.61us 5.5605, 0.000300, 20.97us 9.0440, 0.000488, 26.40us 12.7847, 0.000690, 30.64us 25.3050, 0.001366, 88.05us (9, 9, 9) 6.0817, 0.000154, 20.63us 9.5416, 0.000242, 20.84us 14.2416, 0.000362, 32.47us 22.8452, 0.000580, 78.57us 40.3246, 0.001024, 194.50us (11, 11, 11) 11.1144, 0.000155, 20.56us 14.5581, 0.000203, 20.91us 20.8263, 0.000290, 38.07us 33.0004, 0.000459, 52.74us 57.3275, 0.000798, 137.19us (13, 13, 13) 16.5176, 0.000139, 21.26us 20.8089, 0.000175, 22.33us 31.3433, 0.000264, 42.93us 45.9733, 0.000388, 59.84us 82.8301, 0.000698, 138.42us Batch : 1, Channel : 16 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0274, 0.001715, 74.99us 0.0485, 0.003034, 159.92us 5.9925, 0.374529, 546.35us 7.3389, 0.458679, 1544.53us 7.8354, 0.489714, 6677.00us (3, 3, 3) 0.1560, 0.000361, 20.72us 0.3043, 0.000704, 22.37us 0.5838, 0.001352, 54.97us 1.0455, 0.002420, 130.57us 113.9739, 0.263828, 463.43us (5, 5, 5) 0.6121, 0.000306, 20.12us 0.7247, 0.000362, 20.73us 1.3740, 0.000687, 26.59us 2.3794, 0.001190, 32.12us 4.1929, 0.002096, 165.81us (7, 7, 7) 1.2389, 0.000226, 20.59us 1.6311, 0.000297, 20.53us 2.6732, 0.000487, 20.37us 3.7501, 0.000683, 20.71us 7.4575, 0.001359, 59.16us (9, 9, 9) 1.7983, 0.000154, 20.64us 2.8075, 0.000241, 20.59us 4.2165, 0.000361, 20.38us 6.7153, 0.000576, 38.29us 12.0530, 0.001033, 86.33us (11, 11, 11) 3.3326, 0.000156, 20.56us 4.3061, 0.000202, 20.67us 6.2235, 0.000292, 20.47us 9.8009, 0.000460, 27.41us 16.9994, 0.000798, 68.49us (13, 13, 13) 4.9016, 0.000139, 20.63us 6.1261, 0.000174, 20.65us 9.2106, 0.000262, 20.93us 13.5843, 0.000386, 27.95us 24.6476, 0.000701, 64.88us Batch : 8, Channel : 1 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.0170, 0.002122, 74.99us 0.0316, 0.003946, 160.66us 3.0013, 0.375158, 546.94us 3.6780, 0.459753, 1544.58us 3.9197, 0.489966, 5948.43us (3, 3, 3) 0.0821, 0.000380, 20.27us 0.1559, 0.000722, 22.29us 0.3133, 0.001450, 54.72us 0.5100, 0.002361, 130.12us 57.0481, 0.264111, 376.71us (5, 5, 5) 0.3075, 0.000307, 20.57us 0.3680, 0.000368, 20.69us 0.6786, 0.000679, 26.61us 1.1744, 0.001174, 31.77us 2.0654, 0.002065, 128.31us (7, 7, 7) 0.6512, 0.000237, 20.60us 0.8359, 0.000305, 20.50us 1.3712, 0.000500, 20.75us 1.9472, 0.000710, 20.92us 3.7586, 0.001370, 44.59us (9, 9, 9) 0.9138, 0.000157, 20.43us 1.4198, 0.000243, 20.58us 2.1018, 0.000360, 20.52us 3.3691, 0.000578, 35.90us 5.9491, 0.001020, 69.16us (11, 11, 11) 1.6606, 0.000156, 20.63us 2.1599, 0.000203, 20.57us 3.1240, 0.000293, 20.98us 4.8874, 0.000459, 24.65us 8.4780, 0.000796, 56.47us (13, 13, 13) 2.4987, 0.000142, 20.71us 3.0667, 0.000174, 20.45us 4.6387, 0.000264, 20.76us 6.8187, 0.000388, 25.95us 12.2077, 0.000695, 50.46us Batch : 8, Channel : 16 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.2635, 0.002059, 75.66us 0.4030, 0.003149, 161.78us 48.0296, 0.375231, 550.46us 58.7787, 0.459209, 1902.41us 62.6966, 0.489817, 7817.48us (3, 3, 3) 1.2271, 0.000355, 20.72us 2.4185, 0.000700, 26.44us 4.6933, 0.001358, 64.66us 7.7016, 0.002228, 192.69us 912.0736, 0.263910, 593.69us (5, 5, 5) 4.8716, 0.000304, 24.75us 5.8624, 0.000366, 21.39us 11.0705, 0.000692, 66.94us 18.9280, 0.001183, 104.93us 34.0512, 0.002128, 441.81us (7, 7, 7) 10.1713, 0.000232, 20.98us 13.2273, 0.000301, 36.26us 21.5426, 0.000491, 52.18us 30.1910, 0.000688, 72.94us 59.8381, 0.001363, 191.52us (9, 9, 9) 14.4542, 0.000155, 23.85us 22.6579, 0.000243, 30.59us 33.8839, 0.000363, 57.40us 54.3563, 0.000583, 142.53us 95.8123, 0.001027, 309.24us (11, 11, 11) 26.3348, 0.000155, 30.07us 34.3043, 0.000201, 37.01us 49.8093, 0.000292, 74.04us 78.3720, 0.000460, 110.53us 136.5404, 0.000801, 264.14us (13, 13, 13) 39.3550, 0.000140, 37.38us 49.3207, 0.000175, 43.51us 74.1139, 0.000264, 83.70us 108.7627, 0.000387, 136.09us 196.5412, 0.000699, 280.16us Batch : 8, Channel : 54 / (diff, diff / numel, time) k \ shape (12, 12, 12) (16, 16, 16) (16, 32, 32) (16, 56, 56) (16, 112, 112) (1, 1, 1) 0.8467, 0.001960, 147.36us 1.3993, 0.003239, 314.95us 162.0182, 0.375042, 1327.22us 198.3226, 0.459080, 3921.79us 211.6123, 0.489843, 15646.94us (3, 3, 3) 4.3146, 0.000370, 29.23us 8.1125, 0.000696, 74.94us 15.8886, 0.001362, 223.69us 26.2404, 0.002250, 601.33us 3076.5354, 0.263763, 1974.06us (5, 5, 5) 16.5032, 0.000306, 58.79us 19.6887, 0.000365, 53.79us 37.2731, 0.000690, 192.34us 63.3076, 0.001172, 270.01us 114.8880, 0.002128, 1148.56us (7, 7, 7) 34.0802, 0.000230, 51.12us 44.4087, 0.000300, 100.93us 72.4613, 0.000489, 161.48us 101.9317, 0.000688, 202.91us 201.8955, 0.001363, 545.33us (9, 9, 9) 48.8179, 0.000155, 65.78us 76.3465, 0.000242, 87.48us 114.0228, 0.000362, 179.11us 182.9805, 0.000581, 403.66us 322.7040, 0.001025, 894.86us (11, 11, 11) 88.9993, 0.000155, 88.69us 116.4213, 0.000202, 107.55us 168.3363, 0.000293, 228.71us 264.2232, 0.000460, 322.84us 459.1324, 0.000799, 784.25us (13, 13, 13) 132.7447, 0.000140, 112.91us 165.4525, 0.000174, 131.08us 249.7127, 0.000263, 266.43us 367.0824, 0.000387, 410.17us 663.1367, 0.000699, 847.87us Float half Discepancies 198.37625122070312 0.4592042852331091 432 Sum : tensor(25110.1484, device='cuda:0') tensor(25104., device='cuda:0', dtype=torch.float16) ``` </p> </details> ngimel malfet anjali411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53607 Reviewed By: mruberry Differential Revision: D27652337 Pulled By: ngimel fbshipit-source-id: 6439c0cafe6ca3f761a3f5d058050a55e9a0abd8	2021-04-08 15:48:08 -07:00
lezcano	d3d7f57c2c	Fix a problem when removing parametrizations (#55456 ) Summary: There was an error when removing a parametrization with `leave_parametrized=True`. It had escaped the previous tests. This PR should fix that. Edit. I also took this chance to fix a few mistakes that the documentation had, and to also write the `set_original_` in a more compact way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55456 Reviewed By: mrshenli Differential Revision: D27620481 Pulled By: albanD fbshipit-source-id: f1298ddbcf24566ef48850c62a1eb4d8a3576152	2021-04-08 06:39:28 -07:00
Maxim Grechkin	38a08a49ea	Flip clip_grad_norm default for error_if_nonfinite to false (#55169 ) Summary: Non-backwards-compatible change introduced in https://github.com/pytorch/pytorch/pull/53843 is tripping up a lot of code. Better to set it to False initially and then potentially flip to True in the later version to give people time to adapt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55169 Reviewed By: mruberry Differential Revision: D27511150 Pulled By: jbschlosser fbshipit-source-id: 1ac018557c0900b31995c29f04aea060a27bc525	2021-04-02 12:25:32 -07:00
Alexander Golynski	978fca64a6	Revert D25399470: add channels last for MaxPool2d Test Plan: revert-hammer Differential Revision: D25399470 (`f43eb59a68`) Original commit changeset: b49b9581f132 fbshipit-source-id: ab8c053964aeecf196f6d932c63ada51a3b7ced8	2021-04-02 10:15:11 -07:00
mingfeima	f43eb59a68	add channels last for MaxPool2d (#48917 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48917 max_pool2d channels last support forward path max_pool2d channels last support backward path vectorize channels last forward path rename the header file fix windows build combine PoolingKernel.h into Pool.h add data type check loosen test_max_pool2d_nhwc to cover device CPU Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D25399470 Pulled By: VitalyFedyunin fbshipit-source-id: b49b9581f1329a8c2b9c75bb10f12e2650e4c65a	2021-04-02 09:13:06 -07:00
Michael Melesse	26c1e2ee83	[ROCM] enable miopen for rnn f16 (#52475 ) Summary: This PR enables using MIOpen for RNN FP16 on ROCM. It does this by altering use_miopen to allow fp16. In the special case where LSTMs use projections we use the default implementation, as it is not implemented in MIOpen at this time. We do send out a warning once to let the user know. We then remove the various asserts that are no longer necessary since we handle the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52475 Reviewed By: H-Huang Differential Revision: D27449150 Pulled By: malfet fbshipit-source-id: 06499adb94f28d4aad73fa52890d6ba361937ea6	2021-03-31 14:39:54 -07:00
Joel Schlosser	0bd96458ba	Revert D26820202: Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants Test Plan: revert-hammer Differential Revision: D26820202 (`f9097c43b9`) Original commit changeset: 3e8f09523329 fbshipit-source-id: 5742b69a96ce1c848d75348d0f761cf66a69cbf3	2021-03-31 13:57:44 -07:00
Arindam Roy	b907d6e3b6	[ROCm] skip some tests to enable 4.1 CI upgrade (#54536 ) Summary: Skips the tests indicated as failing in https://github.com/pytorch/pytorch/issues/54535. During the ROCm CI upgrade from 4.0.1 to 4.1, some tests regressed. Specifically, FFT tests in test_spectral_ops.py and test_grid_sample in test_nn.py. In order to keep a passing CI signal, we need to disable these temporarily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54536 Reviewed By: H-Huang Differential Revision: D27442974 Pulled By: malfet fbshipit-source-id: 07dffb957757a5fc7afaa5bf78b935a427251ef4	2021-03-30 17:49:45 -07:00
Edward Yang	6c8d783830	Generate no-op meta functions for all inplace operations (#54901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901 Some subtleties: - Need to make sure not to clobber composite definitions when deciding when to generate - I was lazy and so I didn't make inplace on TensorList work, nor did I make inplace functions that returned void work - A few tests started complaining that these noop meta functions weren't raising the errors they needed. This is tracked in https://github.com/pytorch/pytorch/issues/54897 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D27407232 Pulled By: ezyang fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29	2021-03-30 09:31:39 -07:00
Peter Bell	2503028ff5	Fix ConvTranspose with padding as a list of values (#54911 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/54452 The assertion that fails in the issue is necessary to appease mypy. Instead, I fix `_ntuple` to always return a `tuple`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54911 Reviewed By: H-Huang Differential Revision: D27411088 Pulled By: jbschlosser fbshipit-source-id: 7f5045c58dd4f5f3b07b4826d9b4ca85606c5bce	2021-03-30 07:37:31 -07:00
Zheng Yan	f9097c43b9	Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants (#53655 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53655 Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same. Test Plan: unit tests Reviewed By: qizzzh Differential Revision: D26820202 fbshipit-source-id: 3e8f09523329ea12393ea92ee9a6315aa40a0b7f	2021-03-29 23:58:03 -07:00
Kurt Mohler	3ddc6174da	Raise error in clip_grad_norm_ if norm is non-finite (#53843 ) Summary: BC-breaking note: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False` Fixes https://github.com/pytorch/pytorch/issues/46849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843 Reviewed By: malfet Differential Revision: D27291838 Pulled By: jbschlosser fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4	2021-03-29 08:41:21 -07:00
Brian Hirsh	86b1f4e9f2	fix silent correctness bug with channels_last usage of upsample cuda kernels (#54744 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54744 Fixes https://github.com/pytorch/pytorch/issues/54590 After the porting the upsample operators to be structured, they now forward memory_format information to the output. This is a problem for the cuda kernels, which are not implemented to deal with `torch.channels_last` memory format. The operators are: * upsample_nearest2d * upsample_bilinear2d * upsample_nearest3d * upsample_trilinear3d This fix just allocates a temporary, contiguous output tensor when that happens, writes the results to the temporary and copies the results back to the output tensor. I held off on adding tests to get the fix out quickly, but I wrote a script and ran some manual tests, that basically just asserts that the outputs are the same for cpu and cuda, for some threshold. I ran it for all 4 operators: ``` import torch def basically_equal(t1, t2): epsilon = 1e-4 diffs = torch.abs(t1 - t2) print(torch.all(diffs < 1e-4)) # upsample 2d a = torch.arange(48).reshape(2, 2, 3, 4).contiguous(memory_format=torch.channels_last).float() out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='nearest') out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='nearest') basically_equal(out_cpu, out_cuda.to("cpu")) out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='bilinear', align_corners=True) out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='bilinear', align_corners=True) basically_equal(out_cpu, out_cuda.to("cpu")) # upsample 3d a = torch.arange(96).reshape(2, 2, 2, 3, 4).contiguous(memory_format=torch.channels_last_3d).float() out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='nearest') out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='nearest') basically_equal(out_cpu, out_cuda.to("cpu")) out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='trilinear', align_corners=True) out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='trilinear', align_corners=True) basically_equal(out_cpu, out_cuda.to("cpu")) ``` prints ``` tensor(True) tensor(True) tensor(True) tensor(True) ``` One thing that was weird- `upsample_bilinear2d` and `upsample_trilinear3d` were only accurate across cpu/cuda with an epsilon of `1e-4`. That tentatively sounds close enough to say that cuda isn't "wrong" (?), but that's not exactly "equal"... and I also ran the script before my change, and `bilinear2d` and `trilinear3d` were also the same across cpu/cuda with an epsilon of `1e-4`. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D27351393 Pulled By: bdhirsh fbshipit-source-id: b33f46e4855dc8b49b363770190b639beebbf5a7	2021-03-29 06:42:03 -07:00
Thomas Viehmann	d12118c0aa	Handle stride > 1 with im2col in CUDA thnn conv2d (#54080 ) Summary: The fallback thnn 2d convolution uses `im2col` to get patches and `gemm` to implement convolution . I has a shortcut to use `gemm` directly for kernel size 1, but this only works for stride == 1 and padding == 0. This PR adds checks for stride == 1 and padding == 0 to determining whether `im2col` can be skipped. Fixes https://github.com/pytorch/pytorch/issues/54036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/54080 Reviewed By: ejguan Differential Revision: D27170482 Pulled By: zou3519 fbshipit-source-id: 055d6502239d34945934de409d78144d8a5c56f4	2021-03-25 09:53:49 -07:00
haozhe.zhu	947ab84fd2	enable_and_enhance_bf16_threshold (#54384 ) Summary: enable_and_enhance_bf16_threshold Pull Request resolved: https://github.com/pytorch/pytorch/pull/54384 Reviewed By: ngimel Differential Revision: D27286323 Pulled By: mruberry fbshipit-source-id: 517fa94764d8202bbcbf94011d2d48f716fbd01b	2021-03-24 22:46:20 -07:00
Xiang Gao	9f336bdf10	Fixes new tf32 failures in test_nn.py (#52871 ) Summary: Also modify the `tf32_on_and_off` decorator to make it support function without `device` argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52871 Reviewed By: ngimel Differential Revision: D27286674 Pulled By: mruberry fbshipit-source-id: 14f6d558271bd6a1d0bc40691c170d47e81de1ff	2021-03-24 21:53:33 -07:00
Peter Bell	04e0cbf5a9	Add padding='same' mode to conv{1,2,3}d (#45667 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45667 First part of #3867 (Pooling operators still to do) This adds a `padding='same'` mode to the interface of `conv{n}d`and `nn.Conv{n}d`. This should match the behaviour of `tensorflow`. I couldn't find it explicitly documented but through experimentation I found `tensorflow` returns the shape `ceil(len/stride)` and always adds any extra asymmetric padding onto the right side of the input. Since the `native_functions.yaml` schema doesn't seem to support strings or enums, I've moved the function interface into python and it now dispatches between the numerically padded `conv{n}d` and the `_conv{n}d_same` variant. Underscores because I couldn't see any way to avoid exporting a function into the `torch` namespace. A note on asymmetric padding. The total padding required can be odd if both the kernel-length is even and the dilation is odd. mkldnn has native support for asymmetric padding, so there is no overhead there, but for other backends I resort to padding the input tensor by 1 on the right hand side to make the remaining padding symmetrical. In these cases, I use `TORCH_WARN_ONCE` to notify the user of the performance implications. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D27170744 Pulled By: jbschlosser fbshipit-source-id: b3d8a0380e0787ae781f2e5d8ee365a7bfd49f22	2021-03-18 16:22:03 -07:00
Vitaly Fedyunin	ce2f71836c	Disabling dispatch to OneDNN for group convolutions when groups size = 24 * n (#53991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53991 Reviewed By: malfet Differential Revision: D27048155 Pulled By: VitalyFedyunin fbshipit-source-id: 5009f064220156ca14e1eb97172cfd4f7531b2a9	2021-03-15 19:30:19 -07:00
Yi Wang	d726ce6668	Support loading a non-DP/DDP model from a DP/DDP state_dict (#53224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53224 Loading a DP/DDP dict just needs to strip the module prefix from all items in the state dict and the metadata. One existing example is here: https://github.com/facebookresearch/fvcore/blob/master/fvcore/common/checkpoint.py#L239. #Closes: https://github.com/pytorch/pytorch/issues/41048/ ghstack-source-id: 123722976 Test Plan: buck test mode/dev-nosan caffe2/test:nn -- test_load_state_dict buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_save_load_checkpoint Reviewed By: rohan-varma, mrshenli Differential Revision: D26798495 fbshipit-source-id: 035c7d0907d7ae8f0d7ca21ec71f7f96ef8df6c8	2021-03-11 18:43:33 -08:00
Jagadish Krishnamoorthy	0a549f9412	[ROCm] Disable flaky tests on ROCm (#53192 ) Summary: The disabled tests are tracked by https://github.com/pytorch/pytorch/issues/53190 Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/53192 Reviewed By: zhangguanheng66 Differential Revision: D26782204 Pulled By: mrshenli fbshipit-source-id: bc90b182c236249961da1f0d4894d29f6b44fa27	2021-03-11 08:29:12 -08:00
Brian Hirsh	c68cc24cee	update upsample tests in test_nn.py to test for memory_format (#53665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665 ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly. There were two reasons the original test didn't pick up on a memory format regression: - They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)` - Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D26929683 Pulled By: bdhirsh fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612	2021-03-10 14:21:14 -08:00
Thomas Viehmann	e13ef777a7	Use native ctc loss for target length 256 (#53557 ) Summary: Apparently cudnn (8.1) does not like 256-long targets. Thank you raotnameh for reporting. Fixes https://github.com/pytorch/pytorch/issues/53505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53557 Reviewed By: VitalyFedyunin Differential Revision: D26947262 Pulled By: albanD fbshipit-source-id: df6da7db8fd8e35050b4303ff1658646ebc60141	2021-03-10 10:13:42 -08:00
kshitij12345	45ddf113c9	[fix] nn.Embedding: allow changing the padding vector (#53447 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53447 Reviewed By: albanD Differential Revision: D26946284 Pulled By: jbschlosser fbshipit-source-id: 54e5eec7da86fa02b1b6e4a235d66976a80764fc	2021-03-10 09:53:27 -08:00
Tomasz Grzegorzek	a3465214ba	move rnn cell size check to cpp (#51964 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/32193. Possible further improvements: - do the same for quantized cells - reuse newly written functions in `56034636b9/torch/csrc/api/src/nn/modules/rnn.cpp (L699-L715)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/51964 Reviewed By: albanD Differential Revision: D26757050 Pulled By: ngimel fbshipit-source-id: 9c917d9124de2b914ad9915c79af675ae561295a	2021-03-09 15:02:20 -08:00
Xiao Wang	ef3765b992	Fix a cuda max_pool3d issue, do multiplication in int64 (#52828 ) Summary: Fix https://github.com/pytorch/pytorch/issues/52822 - [x] benchmark Pull Request resolved: https://github.com/pytorch/pytorch/pull/52828 Reviewed By: mrshenli Differential Revision: D26866674 Pulled By: heitorschueroff fbshipit-source-id: bd8276dd70316a767dc6e1991c1259f1f0b390b2	2021-03-09 10:54:43 -08:00
lezcano	7aeee2849b	Parametrization Functionality (#33344 ) Summary: Provides the implementation for feature request issue https://github.com/pytorch/pytorch/issues/28937. Adds the `Parametrization` functionality and implements `Pruning` on top of it. It adds the `auto` mode, on which the parametrization is just computed once per forwards pass. The previous implementation computed the pruning on every forward, which is not optimal when pruning RNNs for example. It implements a caching mechanism for parameters. This is implemented through the mechanism proposed at the end of the discussion https://github.com/pytorch/pytorch/issues/7313. In particular, it assumes that the user will not manually change the updated parameters between the call to `backwards()` and the `optimizer.step()`. If they do so, they would need to manually call the `.invalidate()` function provided in the implementation. This could be made into a function that gets a model and invalidates all the parameters in it. It might be the case that this function has to be called in the `.cuda()` and `.to` and related functions. As described in https://github.com/pytorch/pytorch/issues/7313, this could be used, to implement in a cleaner way the `weight_norm` and `spectral_norm` functions. It also allows, as described in https://github.com/pytorch/pytorch/issues/28937, for the implementation of constrained optimization on manifolds (i.e. orthogonal constraints, positive definite matrices, invertible matrices, weights on the sphere or the hyperbolic space...) TODO (when implementation is validated): - More thorough test - Documentation Resolves https://github.com/pytorch/pytorch/issues/28937 albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/33344 Reviewed By: zhangguanheng66 Differential Revision: D26816708 Pulled By: albanD fbshipit-source-id: 07c8f0da661f74e919767eae31335a9c60d9e8fe	2021-03-04 12:45:27 -08:00
Joel Schlosser	e86476f736	Huber loss (#50553 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48595. ## Background This PR implements HuberLoss, which differs from SmoothL1Loss by a factor of beta. The current implementation does not share logic between the two. Feedback is welcome for the optimal way to minimize code duplication while remaining performant. I've done some early [benchmarking](https://pytorch.org/tutorials/recipes/recipes/benchmark.html#collecting-instruction-counts-with-callgrind) with Huber calling in to the Smooth L1 kernel and scaling afterwards; for the simple test case I used, instruction counts are as follows: ``` Huber loss calls dedicated Huber kernel: 2,795,300 Huber loss calls Smooth L1 kernel and scales afterwards: 4,523,612 ``` With these numbers, instruction counts are ~62% higher when using the pre-existing Smooth L1 kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50553 Test Plan: ``` python test/test_nn.py TestNN.test_HuberLoss python test/test_nn.py TestNN.test_HuberLoss_delta python test/test_nn.py TestNN.test_huber_loss_invalid_delta python test/test_nn.py TestNNDeviceTypeCPU.test_smooth_l1_loss_vs_huber_loss_cpu python test/test_nn.py TestNNDeviceTypeCUDA.test_smooth_l1_loss_vs_huber_loss_cuda python test/test_nn.py TestNNDeviceTypeCPU.test_invalid_reduction_strings_cpu python test/test_nn.py TestNNDeviceTypeCUDA.test_invalid_reduction_strings_cuda python test/test_nn.py TestNN.test_loss_equal_input_target_shape python test/test_nn.py TestNN.test_pointwise_loss_broadcast python test/test_overrides.py python test/test_jit.py TestJitGeneratedFunctional.test_nn_huber_loss python test/test_type_hints.py python test/test_cpp_api_parity.py build/bin/test_api ``` ## Documentation <img width="677" alt="Screen Shot 2021-01-14 at 4 25 08 PM" src="https://user-images.githubusercontent.com/75754324/104651224-5a445980-5685-11eb-884b-14ea517958c2.png"> <img width="677" alt="Screen Shot 2021-01-14 at 4 24 35 PM" src="https://user-images.githubusercontent.com/75754324/104651190-4e589780-5685-11eb-974d-8c63a89c050e.png"> <img width="661" alt="Screen Shot 2021-01-14 at 4 24 45 PM" src="https://user-images.githubusercontent.com/75754324/104651198-50225b00-5685-11eb-958e-136b36f6f8a8.png"> <img width="869" alt="Screen Shot 2021-01-14 at 4 25 27 PM" src="https://user-images.githubusercontent.com/75754324/104651208-53b5e200-5685-11eb-9fe4-5ff433aa13c5.png"> <img width="862" alt="Screen Shot 2021-01-14 at 4 25 48 PM" src="https://user-images.githubusercontent.com/75754324/104651209-53b5e200-5685-11eb-8051-b0cfddcb07d3.png"> Reviewed By: H-Huang Differential Revision: D26734071 Pulled By: jbschlosser fbshipit-source-id: c98c1b5f32a16f7a2a4e04bdce678080eceed5d5	2021-03-02 17:30:45 -08:00
Thomas J. Fan	e2ecfb60a6	FIX Validates target in cosine_embedding (#53110 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53030 This PR validates the target for `cosine_embedding_loss`. This is consistent with how `cross_entropy` handles non 1d targets: ```py import torch import torch.nn.functional as F input = torch.randn(3, 5, requires_grad=True) target = torch.randint(5, (3, 1)) # Raises RuntimeError loss = F.cross_entropy(input, target) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/53110 Reviewed By: VitalyFedyunin Differential Revision: D26766579 Pulled By: jbschlosser fbshipit-source-id: 73ad559ff9376543b6528a36af094e82eb6f9735	2021-03-02 16:50:44 -08:00
Edward Yang	baed2cfe01	Back out "Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled" (#53127 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53127 Original commit changeset: cc9cc4f508af ghstack-source-id: 122871468 Test Plan: run flake8 on the files locally Reviewed By: malfet, janeyx99 Differential Revision: D26757859 fbshipit-source-id: 7e7bde5c1f2b434442079656e2186b500d53fdc2	2021-03-02 14:46:56 -08:00
Edward Yang	2d7119f943	Revert D26753571: [pytorch][PR] add submodules to sys.modules so their attributes can be pickled Test Plan: revert-hammer Differential Revision: D26753571 (`fbf9745c85`) Original commit changeset: 2bda03bab39f fbshipit-source-id: cc9cc4f508af122b0fdec7f8475343bd9badb9db	2021-03-02 11:11:31 -08:00
Kyle Chen	d8ef3a4793	[ROCm] Enable test cases in test_nn.py for ROCm (#52836 ) Summary: Enabling tests in test_nn.py for ROCm because they are passing. Signed-off-by: Kyle Chen <kylechen@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/52836 Reviewed By: H-Huang Differential Revision: D26725891 Pulled By: mruberry fbshipit-source-id: 59655a2515ddce92ffc4c55dcf6f28257c05e3c9	2021-03-02 10:56:07 -08:00
mattip	fbf9745c85	add submodules to sys.modules so their attributes can be pickled (#53107 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/38137 As mentioned in the issue, this is a workaround for [python issue 43367](https://bugs.python.org/issue43367). There are a number of other places where `sys.modules` is modified, if something changes in python perhaps those should be reviewed as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53107 Reviewed By: zou3519 Differential Revision: D26753571 Pulled By: ezyang fbshipit-source-id: 2bda03bab39ff9ca58ce4bc13befe021da91b9c4	2021-03-02 10:47:21 -08:00
Xiang Gao	a6b7da7dfe	Add 64bit indexing support for softmax (#52713 ) Summary: fixes https://github.com/pytorch/pytorch/issues/52715 https://github.com/pytorch/pytorch/issues/52716 split across batch dimension Pull Request resolved: https://github.com/pytorch/pytorch/pull/52713 Reviewed By: ailzhang Differential Revision: D26640033 Pulled By: ngimel fbshipit-source-id: f169cb0d6abc1cfbddf658d9775759a7d56f5c12	2021-02-24 21:39:58 -08:00
Nikita Shulga	59ac0ff037	Change `maybe_resize_storage_cpu` `new_size` arg to unsigned (#52671 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52671 Code is written with the assumption that new_size is unsigned value, and when function is called with negative value it silently returns a nullptr rather than raise an exception. Fix above-mentioned logic by converting new_size to unsigned type and let cpu_allocator raise exception on negative alloc. Unroll nested if blocks by returning early if new_size is 0 Add TestNN.test_adaptive_pooling_size_overflow to indirecty validate the fix. Fixes https://github.com/pytorch/pytorch/issues/50960 Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D26607549 Pulled By: malfet fbshipit-source-id: e3d4f7548b098f24fa5aba42d8f4e9288ece1e2e	2021-02-24 09:50:28 -08:00
Joel Schlosser	a39b1c42c1	MHA: Fix regression and apply bias flag to both in/out proj (#52537 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/52257 ## Background Reverts MHA behavior for `bias` flag to that of v1.5: flag enables or disables both in and out projection biases. Updates type annotations for both in and out projections biases from `Tensor` to `Optional[Tensor]` for `torch.jit.script` usage. Note: With this change, `_LinearWithBias` defined in `torch/nn/modules/linear.py` is no longer utilized. Completely removing it would require updates to quantization logic in the following files: ``` test/quantization/test_quantized_module.py torch/nn/quantizable/modules/activation.py torch/nn/quantized/dynamic/modules/linear.py torch/nn/quantized/modules/linear.py torch/quantization/quantization_mappings.py ``` This PR takes a conservative initial approach and leaves these files unchanged. Is it safe to fully remove `_LinearWithBias`? Pull Request resolved: https://github.com/pytorch/pytorch/pull/52537 Test Plan: ``` python test/test_nn.py TestNN.test_multihead_attn_no_bias ``` ## BC-Breaking Note In v1.6, the behavior of `MultiheadAttention`'s `bias` flag was incorrectly changed to affect only the in projection layer. That is, setting `bias=False` would fail to disable the bias for the out projection layer. This regression has been fixed, and the `bias` flag now correctly applies to both the in and out projection layers. Reviewed By: bdhirsh Differential Revision: D26583639 Pulled By: jbschlosser fbshipit-source-id: b805f3a052628efb28b89377a41e06f71747ac5b	2021-02-22 14:47:12 -08:00
kshitij12345	ad3319cbc2	fractional_max_pool{2/3}d : Fix segfaults for incorrect kernel_size and output_size (#51626 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/50967 TODO: * [x] Add test for `fractional_max_pool3d` similar to `fractional_max_pool2d` (since there is no test for the same). Needs Resolution: * [ ] ASAN failure on the newly added 3d variant test. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673756 * [ ] Failing gradcheck on MacOS. https://app.circleci.com/pipelines/github/pytorch/pytorch/269483/workflows/8426b3b7-9a35-4032-a57a-729964a4a5ff/jobs/10673101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51626 Reviewed By: jbschlosser Differential Revision: D26514064 Pulled By: heitorschueroff fbshipit-source-id: e2cc57585dbc3a08c7f24591b202e0fabfd2a459	2021-02-22 12:06:36 -08:00
Gregory Chanan	f72b4b83fe	Fix upsample bicubic2d batching handling on CPU. (#52389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52389 Fixes: https://github.com/pytorch/pytorch/issues/49159 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26496319 Pulled By: gchanan fbshipit-source-id: d385cd683ef09e0596a9875ce84d03e6e77acc93	2021-02-18 09:14:41 -08:00
zilinzhu	c8b3686a3e	Make bias in lazy modules lazy and avoid create empty tensors (#52212 ) Summary: Some minor improvement for lazy modules introduced in https://github.com/pytorch/pytorch/issues/44538, https://github.com/pytorch/pytorch/issues/47350 and https://github.com/pytorch/pytorch/issues/51548. This PR mainly turn the bias to `UninitializedParameter` and instead of creating empty tensors like ```python self.bias = Parameter(torch.Tensor(0)) self.bias = UninitializedParameter() ``` I think it would be better to ```python self.register_parameter('bias', None) self.bias = UninitializedParameter() ``` In addition, I change the constructor of the `LazyBatchNorm` from ```python self.running_mean = UninitializedBuffer() ``` to ```python self.register_buffer('running_mean', UninitializedBuffer()) ``` as the original one would not change the underlying `self._buffers`. Thank you for your time on reviewing this PR :). Gently ping albanD, mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/52212 Reviewed By: jbschlosser Differential Revision: D26504508 Pulled By: albanD fbshipit-source-id: 7094d0bb4fa9e2a40a07b79d350ea12a6ebfd080	2021-02-18 06:34:53 -08:00
Vitaly Fedyunin	8bf846d2c8	Skip OneDNN Convolution in case of groups = 24 #50042 (#52327 ) Summary: Temporary disabling OneDNN conv for group size = 24 as OneDNN update came too late to be fully tested https://github.com/pytorch/pytorch/issues/50042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/52327 Reviewed By: agolynski Differential Revision: D26474186 Pulled By: VitalyFedyunin fbshipit-source-id: 8d6964d33c8dcab70e207088c3940810eabbd068	2021-02-17 14:49:23 -08:00
Jane Xu	68e2a8c420	Reenable test_nn tests for Windows (#52051 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/52002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/52051 Reviewed By: ngimel Differential Revision: D26409749 Pulled By: janeyx99 fbshipit-source-id: 5fa76d4fff8cf0fe2130c925fde9dffd0d1e7172	2021-02-16 08:00:07 -08:00
Phi Nguyen	490eb3e735	Add 3D depthwise seperable convolution (#51027 ) Summary: Because this pull request (https://github.com/pytorch/pytorch/issues/40801) becomes an important part of recent 3D models, brings significant improvement in speed, and also have been open for a while. So I decided to resolve the previous review comment and modify it a bit so that it can be merged into the latest version of Pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51027 Reviewed By: albanD Differential Revision: D26414116 Pulled By: ngimel fbshipit-source-id: 562c099f4d7f6d603a9c2f2e2a518bc577b0d8ee	2021-02-13 18:14:09 -08:00
Jane Xu	bff8194522	Replace 11.1 with 11.2 on CI for Windows (#51598 ) Summary: Adding CUDA 11.2 to Windows CI. Disabled tests: The following ran into `CUDA error: misaligned address` for CUDA 11.2: (issue linked below) `test_where_scalar_valid_combination_cuda_complex128` in test_torch.py `test_sgn_complex_cuda` in test_autograd.py The following ran into `CUDA error: too many resources requested for launch` for CUDA 11.2: (https://github.com/pytorch/pytorch/issues/52002) test_EmbeddingBag_per_sample_weights_and_new_offsets_cuda_int64_float64 test_EmbeddingBag_per_sample_weights_and_offsets_cuda_int64_float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51598 Reviewed By: mrshenli Differential Revision: D26344965 Pulled By: janeyx99 fbshipit-source-id: 3c9a4ed16d748969e96593220ec0a9f33e1ffcef	2021-02-10 17:59:11 -08:00
Akifumi Imanishi	b3fda95fe7	Add LazyBatchNormXd (#51862 ) Summary: Same diff with https://github.com/pytorch/pytorch/issues/51548 (cc. albanD) Pull Request resolved: https://github.com/pytorch/pytorch/pull/51862 Reviewed By: izdeby Differential Revision: D26312289 Pulled By: albanD fbshipit-source-id: 9cdec0e0c9021c33d10d85010978c7fa5cb4dc60	2021-02-09 10:29:03 -08:00
XiaobingSuper	d90911adf9	fix AdaptiveAveragePooling crash problem for non support input (#51443 ) Summary: For none support input, we should not do check in a parallel region, this PR will first do the dtype check, and then do parallel for. Fixes https://github.com/pytorch/pytorch/issues/51352. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51443 Reviewed By: izdeby Differential Revision: D26305584 Pulled By: ngimel fbshipit-source-id: 6faa3148af5bdcd7246771c0ecb4db2b31ac82c6	2021-02-08 11:43:25 -08:00
Alban Desmaison	a930162c69	Revert D26276903: [pytorch][PR] Add LazyBatchNormXd Test Plan: revert-hammer Differential Revision: D26276903 (`aa1fd6b45a`) Original commit changeset: 0ac706974178 fbshipit-source-id: bfe01b01cd460f1e2845ea5ef1fc1514e6b6ba54	2021-02-05 12:37:29 -08:00
Akifumi Imanishi	aa1fd6b45a	Add LazyBatchNormXd (#51548 ) Summary: This PR implements UninitializedBuffer and LazyBatchnormXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD) Pull Request resolved: https://github.com/pytorch/pytorch/pull/51548 Reviewed By: zhangguanheng66 Differential Revision: D26276903 Pulled By: albanD fbshipit-source-id: 0ac706974178363f8af075e59b41d5989418922f	2021-02-05 10:27:04 -08:00
jiej	0e1c5cb354	fixing index clamping for upsample nearest kernel backward (#51240 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/51036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51240 Reviewed By: ailzhang Differential Revision: D26139221 Pulled By: ngimel fbshipit-source-id: 0591ac6d1f988b54c1b1ee50d34fb7c2a3f97c4e	2021-01-31 15:22:58 -08:00
Jeffrey Wan	c0966914bc	Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49409 There are many call sites where, gradcheck/gradgradcheck is now being implicitly invoked with `check_batched_grad` as True, but they were previously False. Cases fall into two basic categories: 1) the call site was previously using `torch.autograd.gradcheck` but is now changed to use the globally imported function instead 3) the call site was already using globally imported function, but does not explicitly pass `check_batched_grad` flag Only in the _assertGradAndGradgradChecks cases, which are infrequent, I assumed that the the author is aware that omitting the flag means not applying check_batched_grad=True. (but maybe that is not the case?) Overall this PR in its current state assumes that unless the author explicitly specified `check_batched_grad=False`, they were just probably not aware of this flag and did not mean to have this flag as False. So far exceptions to the above (as discovered by CI) include: - Mkldnn (opaque tensors do not have strides) https://app.circleci.com/pipelines/github/pytorch/pytorch/264416/workflows/e4d87886-6247-4305-8526-2696130aa9a4/jobs/10401882/tests - all cases in test_sparse (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407103) - all cases in test_overrides (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407236) - test_autograd (test_LSTM_grad_and_gradgrad) - (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407235) - test_data_parallel (test_data_parallel_buffers_requiring_grad) - SIGSEGV (https://app.circleci.com/pipelines/github/pytorch/pytorch/264820/workflows/14d89503-040d-4e3d-9f7b-0bc04833589b/jobs/10422697) - test_nn (https://app.circleci.com/pipelines/github/pytorch/pytorch/264919/workflows/df79e3ed-8a31-4a8e-b584-858ee99686ff/jobs/10427315) Possible TODO is to prevent new tests from invoking external gradcheck. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51133 Reviewed By: ezyang Differential Revision: D26147919 Pulled By: soulitzer fbshipit-source-id: dff883b50f337510a89f391ea2fd87de2d531432	2021-01-29 09:13:37 -08:00
Akshit Khurana	16132a4b1d	Make sure ConstantPadNd op preserves memory format (#50898 ) Summary: * ConstantPadNd op didn't preserve memory format for non quantized cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/50898 Test Plan: pytest test/test_nn.py::TestConstPadNd Reviewed By: kimishpatel Differential Revision: D26003407 Pulled By: axitkhurana fbshipit-source-id: a8b56d32734772acae6f5c2af4dfe0bd3434cab1	2021-01-27 22:36:44 -08:00
Edward Yang	5e79b8e06d	Back out "Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d" (#50794 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50794 Original commit changeset: b4a7948088c0 There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised. * There's a bugfix in the codegen to test if a dispatch key is structured before short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU) * Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them. * It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this * There are some extra sanity checks for NativeFunctions validation * Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d. ghstack-source-id: 120038038 Test Plan: ``` buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled ``` Reviewed By: ngimel Differential Revision: D25962873 fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3	2021-01-25 10:43:53 -08:00
Peter Bell	db079a9877	Padding: support complex dtypes (#50594 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50594 Fixes #50234 Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D25987316 Pulled By: anjali411 fbshipit-source-id: c298b771fe52b267a86938e886ea402badecfe3e	2021-01-22 11:57:42 -08:00
Richard Zou	c7d348fea6	Turn on batched grad testing for non-autogenerated tests in test_nn.py (#50739 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50739 This does not turn on batched grad testing for autogenerated NewModuleTest tests and CriterionTest tests. Those are coming later. Test Plan: - run tests Reviewed By: ejguan Differential Revision: D25997677 Pulled By: zou3519 fbshipit-source-id: b4b2d68e0f99c3d573faf237e1e531d0b3fced40	2021-01-22 07:40:20 -08:00
M.L. Croci	8eb90d4865	Add Gaussian NLL Loss (#50886 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/48520. cc albanD (This is a clean retry PR https://github.com/pytorch/pytorch/issues/49807) Pull Request resolved: https://github.com/pytorch/pytorch/pull/50886 Reviewed By: ejguan Differential Revision: D26007435 Pulled By: albanD fbshipit-source-id: 88fe91b40dea6f72e093e6301f0f04fcc842d2f0	2021-01-22 06:56:49 -08:00
Xiao Wang	db86dd8ad7	Fix replication_pad for cuda launch configuration (#50565 ) Summary: Fix https://github.com/pytorch/pytorch/issues/49601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50565 Reviewed By: mruberry Differential Revision: D25968843 Pulled By: ngimel fbshipit-source-id: 6d2d543132b501765e69b52caaa283fb816db276	2021-01-20 11:52:12 -08:00
AJ San Joaquin	e9b369c25f	Add SELU Activation to calculate_gain (#50664 ) Summary: Fixes #{[24991](https://github.com/pytorch/pytorch/issues/24991)} I used a value of 0.75 as suggested in the forums by Thomas: https://discuss.pytorch.org/t/calculate-gain-tanh/20854/6 I verified that the value keeps the gradient stable for a 100-layer network. Code to reproduce (from [jpeg729](https://discuss.pytorch.org/t/calculate-gain-tanh/20854/4)): ```python import torch import torch.nn.functional as F import sys a = torch.randn(1000,1000, requires_grad=True) b = a print (f"in: {a.std().item():.4f}") for i in range(100): l = torch.nn.Linear(1000,1000, bias=False) torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain("selu")) b = getattr(F, 'selu')(l(b)) if i % 10 == 0: print (f"out: {b.std().item():.4f}", end=" ") a.grad = None b.sum().backward(retain_graph=True) print (f"grad: {a.grad.abs().mean().item():.4f}") ``` Output: ``` in: 1.0008 out: 0.7968 grad: 0.6509 out: 0.3127 grad: 0.2760 out: 0.2404 grad: 0.2337 out: 0.2062 grad: 0.2039 out: 0.2056 grad: 0.1795 out: 0.2044 grad: 0.1977 out: 0.2005 grad: 0.2045 out: 0.2042 grad: 0.2273 out: 0.1944 grad: 0.2034 out: 0.2085 grad: 0.2464 ``` I included the necessary documentation change, and it passes the _test_calculate_gain_nonlinear_ unittest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50664 Reviewed By: mruberry Differential Revision: D25942217 Pulled By: ngimel fbshipit-source-id: 29ff1be25713484fa7c516df71b12fdaecfb9af8	2021-01-18 23:01:18 -08:00
Sameer Deshmukh	7f3a407225	Multi label margin loss (#50007 ) Summary: Reopen PR for https://github.com/pytorch/pytorch/pull/46975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50007 Reviewed By: mruberry Differential Revision: D25850808 Pulled By: ngimel fbshipit-source-id: a232e02949182b7d3799448d24ad54a9e0bcf95c	2021-01-18 01:48:05 -08:00
Natalia Gimelshein	534c82153e	fix bn channels_last contiguity check (#50659 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42588 The contiguity check used to be for memory format suggested by `grad_output->suggest_memory_format()`, but an invariant guaranteed by derivatives.yaml is `input->suggest_memory_format()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/50659 Reviewed By: mruberry Differential Revision: D25938921 Pulled By: ngimel fbshipit-source-id: a945bfef6ce3d91b17e7ff96babe89ffd508939a	2021-01-17 21:10:12 -08:00
Jeffrey Wan	6e3e57095c	Add complex support for torch.nn.L1Loss (#49912 ) Summary: Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640) Things added in this PR: 1. Modify backward and double-backward formulas 2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1) 3. Modify some existing tests to support complex Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912 Reviewed By: zhangguanheng66 Differential Revision: D25853036 Pulled By: soulitzer fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad	2021-01-15 15:53:15 -08:00
Jeffrey Wan	ef6be0ec50	Revert D25903846: [pytorch][PR] Structured kernel definition for upsample_nearest2d Test Plan: revert-hammer Differential Revision: D25903846 (`19a8e68d8c`) Original commit changeset: 0059fda9b7d8 fbshipit-source-id: b4a7948088c0329a3605c32b64ed77e060e63fca	2021-01-14 08:44:48 -08:00
jonykarki	934805bc49	cleaned up ModuleAttributeError (#50298 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49726 Just cleaned up the unnecessary `ModuleAttributeError` BC-breaking note: `ModuleAttributeError` was added in the previous unsuccessful [PR](https://github.com/pytorch/pytorch/pull/49879) and removed here. If a user catches `ModuleAttributeError` specifically, this will no longer work. They should catch `AttributeError` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50298 Reviewed By: mrshenli Differential Revision: D25907620 Pulled By: jbschlosser fbshipit-source-id: cdfa6b1ea76ff080cd243287c10a9d749a3f3d0a	2021-01-14 06:58:01 -08:00
Jeffrey Wan	19a8e68d8c	Structured kernel definition for upsample_nearest2d (#50189 ) Summary: See the structured kernel definition [RFC](https://github.com/pytorch/rfcs/pull/9) for context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50189 Reviewed By: mrshenli Differential Revision: D25903846 Pulled By: soulitzer fbshipit-source-id: 0059fda9b7d86f596ca35d830562dd4b859293a0	2021-01-13 22:48:23 -08:00
Sameer Deshmukh	375c30a717	Avg pool 0 dim acceptance. (#50008 ) Summary: Reopen https://github.com/pytorch/pytorch/pull/47426 since it failed for XLA tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50008 Reviewed By: mruberry Differential Revision: D25857687 Pulled By: ngimel fbshipit-source-id: 8bd47a17b417b20089cf003173d8c0793be58c72	2021-01-09 21:46:05 -08:00
Karthik Prasad	3b56e9d0ef	[pytorch] prune based on custom importance scores (#48378 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48378 This commit adds support for accepting custom importance scores to use for pruning mask computation, rather than only using the parameter. This is useful if one wants to prune based on scores from different technique such as activations, gradients, weighted scoring of parameters, etc. An alternative to the above approach would be pass the custom mask to the already available interface. However, the ability to accept importance scores is easier it can leverage the mask computation logic that has already been baked in. In addition, the commit also makes some minor lint fixes. Test Plan: * Unit tests * Circle CI Differential Revision: D24997355 fbshipit-source-id: 30797897977b57d3e3bc197987da20e88febb1fa	2021-01-07 15:21:43 -08:00
Natalia Gimelshein	cd608fe59b	Revert D25719980: [pytorch][PR] Accept input tensor with 0-dim batch size for MultiLabelMarginLoss Test Plan: revert-hammer Differential Revision: D25719980 (`6b56b71e61`) Original commit changeset: 83414bad37c0 fbshipit-source-id: 27eddd711a2b9e0adbc08bfab12100562e63ac21	2020-12-30 17:06:28 -08:00
Sameer Deshmukh	6b56b71e61	Accept input tensor with 0-dim batch size for MultiLabelMarginLoss (#46975 ) Summary: Fix for one of the layers listed in https://github.com/pytorch/pytorch/issues/12013 or https://github.com/pytorch/pytorch/issues/38115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46975 Reviewed By: mruberry Differential Revision: D25719980 Pulled By: ngimel fbshipit-source-id: 83414bad37c0b004bc7cced04df8b9c89bdba3e6	2020-12-30 13:29:26 -08:00
Jony Karki	e482c70a3d	added List as an option to the unflattened_size (#49838 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49838 Reviewed By: mruberry Differential Revision: D25727971 Pulled By: ngimel fbshipit-source-id: 60142dae84ef107f0083676a2a78ce6b0472b7e1	2020-12-29 16:50:37 -08:00
Joel Schlosser	68d438c9da	Add PixelUnshuffle (#49334 ) Summary: Adds an implementation of `torch.nn.PixelUnshuffle` as the inverse operation of `torch.nn.PixelShuffle`. This addresses https://github.com/pytorch/pytorch/issues/2456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49334 Test Plan: ``` # Unit tests. python test/test_nn.py TestNN.test_pixel_shuffle_unshuffle # Module test. python test/test_nn.py TestNN.test_PixelUnshuffle # C++ API tests. build/bin/test_api # C++ / python parity tests. python test/test_cpp_api_parity.py # JIT test. python test/test_jit.py TestJitGeneratedFunctional.test_nn_pixel_unshuffle # Override tests. python test/test_overrides.py # Type hint tests. python test/test_type_hints.py ``` Screenshots of rendered docs: <img width="876" alt="Screen Shot 2020-12-18 at 12 19 05 PM" src="https://user-images.githubusercontent.com/75754324/102642255-6b07bb00-412b-11eb-88fa-e53e7e8ba720.png"> <img width="984" alt="Screen Shot 2020-12-18 at 12 19 26 PM" src="https://user-images.githubusercontent.com/75754324/102642276-70fd9c00-412b-11eb-8548-445082a2db02.png"> <img width="932" alt="Screen Shot 2020-12-18 at 12 19 34 PM" src="https://user-images.githubusercontent.com/75754324/102642704-19abfb80-412c-11eb-9546-95bdd1c3cf22.png"> <img width="876" alt="Screen Shot 2020-12-22 at 12 51 36 PM" src="https://user-images.githubusercontent.com/75754324/102918259-986aa680-4454-11eb-99e7-a0b4c8b3e283.png"> <img width="869" alt="Screen Shot 2020-12-22 at 12 51 44 PM" src="https://user-images.githubusercontent.com/75754324/102918274-9ef91e00-4454-11eb-94bb-91b58aff47d3.png"> Reviewed By: mruberry Differential Revision: D25401439 Pulled By: jbschlosser fbshipit-source-id: 209d92ce7295e51699e83616d0c62170a7ce75c8	2020-12-22 20:14:55 -08:00
albanD	ccd646696b	Fix Module backward hooks for all Tensor inputs/outputs (#46163 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/598 This is BC-breaking as we now explicitly don't call the hook when there are not Tensors at the top level of the output. This feature was not working anyways as the returned grad_input/grad_output were wrong (not respecting the output structure and wrong inputs for multi-Node Module). This is also BC-breaking as we now report the correct gradients for `nn.Module`s that contain multiple autograd `Node`s while we use to return bad results before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46163 Reviewed By: ailzhang, mruberry Differential Revision: D24894180 Pulled By: albanD fbshipit-source-id: e1b5d193d2818eb2f51e2a2722c7405c8bd13c2b	2020-12-18 09:04:36 -08:00
Igor Gitman	1b6d18aa7c	Adding support for CuDNN-based LSTM with projections (#47725 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46213 I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should. 1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes. 2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that. 3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places. 4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that? 5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47725 Reviewed By: zou3519 Differential Revision: D25449794 Pulled By: ngimel fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c	2020-12-16 11:27:02 -08:00
Xiang Gao	86902f84bf	CUDA BFloat embedding (#44848 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44848 Reviewed By: izdeby Differential Revision: D25574204 Pulled By: ngimel fbshipit-source-id: b35f7253a6ad2b83f7b6b06862a5ab77295373e0	2020-12-16 09:24:46 -08:00
Joel Schlosser	220b91660f	[pytorch] Expand PixelShuffle to support any number of batch dims (#49187 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49187 Expands the implementation of PixelShuffle to support any number of batch dimensions Test Plan: `buck test caffe2/test:nn -- test_pixel_shuffle` Reviewed By: mruberry Differential Revision: D25399058 fbshipit-source-id: ab0a7f593b276cafc9ebb46a177e2c1dce56d0de	2020-12-14 14:52:57 -08:00
mingfeima	690eaf9c43	add channels last for AdaptiveAvgPool2d (#48916 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48916 optimize adaptive average pool2d forward path optimize adaptive average pool2d backward path remove unused headers minor change minor change rename the header; add adaptive max pooling in future. minor change loosen adapative_pool2d test on nhwc to both device cuda and cpu minor change Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25399469 Pulled By: VitalyFedyunin fbshipit-source-id: 86f9fda35194f21144bd4667b778c861c05a5bac	2020-12-14 09:47:46 -08:00
Xiang Gao	5960581148	CUDA BFloat16 batchnorm (non-cuDNN) (#44994 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44994 Reviewed By: ailzhang Differential Revision: D25377525 Pulled By: ngimel fbshipit-source-id: 42d583bbc364532264a4d3ebaa6b4ae02a0413de	2020-12-08 14:25:42 -08:00
CedricPicron	dc7ab46dcc	Fix incorrect warnings in ParameterList/Dict (#48315 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46983. The solution is based of two components: 1. The introduction of the `_initialized` attribute. This will be used during ParameterList/Dict creation methods `__init__` (introduced in https://github.com/pytorch/pytorch/issues/47772) and `__setstate__` to not trigger warnings when setting general `Module` attributes. 2. The introduction of the `not hasattr(self, key)` check to avoid triggering warnings when changing general `Module` attributes such as `.training` during the `train()` and `eval()` methods. Tests related to the fix are added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48315 Reviewed By: mrshenli Differential Revision: D25130217 Pulled By: albanD fbshipit-source-id: 79e2abf1eab616f5de74f75f370c2fe149bed4cb	2020-12-01 07:08:33 -08:00
Akifumi Imanishi	492683bd42	Add LazyConvXd and LazyConvTransposeXd (#47350 ) Summary: This PR implements LazyConvXd and LazyConvTransposeXd based on https://github.com/pytorch/pytorch/issues/44538. (cc. emcastillo and albanD) Pull Request resolved: https://github.com/pytorch/pytorch/pull/47350 Reviewed By: ejguan Differential Revision: D25220645 Pulled By: albanD fbshipit-source-id: b5e2e866d53761a3415fd762d05a81920f8b16c3	2020-12-01 07:00:28 -08:00
Xiao Wang	4ab2055857	Re-enable only cuda tests wrongly disabled before (#48429 ) Summary: Close https://github.com/pytorch/pytorch/issues/46536 Re-enable only cuda tests wrongly disabled in https://github.com/pytorch/pytorch/pull/45332 See discussions https://github.com/pytorch/pytorch/issues/46536#issuecomment-721386038 and https://github.com/pytorch/pytorch/pull/45332#issuecomment-721350987 ~~See also https://github.com/pytorch/pytorch/pull/47237 and https://github.com/pytorch/pytorch/pull/47642~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/48429 Reviewed By: ngimel Differential Revision: D25176368 Pulled By: mruberry fbshipit-source-id: 3822f5a45e58c0e387624e70ea272d16218901a9	2020-11-25 13:26:35 -08:00
albanD	233192be73	Make sure valid ParameterList/Dict don't warn on creation (#47772 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47772 Reviewed By: zou3519 Differential Revision: D24991341 Pulled By: albanD fbshipit-source-id: 0fa21192f529a016048e3eef88c5a8f3cbb3c235	2020-11-16 13:16:59 -08:00
Natalia Gimelshein	982ae987d3	Revert D24941350: [pytorch][PR] Reopen PR for 0 dim batch size for AvgPool2d. Test Plan: revert-hammer Differential Revision: D24941350 (`ceeab70da1`) Original commit changeset: b7e50346d86e fbshipit-source-id: 2e42e4418476658dc1afb905184841bf61688cfd	2020-11-13 22:33:37 -08:00
Sameer Deshmukh	ceeab70da1	Reopen PR for 0 dim batch size for AvgPool2d. (#47426 ) Summary: Resubmitting https://github.com/pytorch/pytorch/pull/40694 since it could not be landed for some reason. CC ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/47426 Reviewed By: mruberry Differential Revision: D24941350 Pulled By: ngimel fbshipit-source-id: b7e50346d86eb63aaaf4fdd5ee71fafee2d0b476	2020-11-13 17:57:35 -08:00
Gao, Xiang	0652d755d3	Fix some flaky tests in test_torch.py and test_nn.py (#46941 ) Summary: Fixed test: - `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert - `test_pinverse` TF32 - `test_symeig` TF32 - `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS - `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype` - `test_lu` TF32 - `ConvTranspose2d` TF32 - `Conv3d_1x1x1_no_bias` TF32 - `Transformer*` TF32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941 Reviewed By: heitorschueroff Differential Revision: D24852725 Pulled By: mruberry fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed	2020-11-12 22:35:42 -08:00
Xiang Gao	2712acbd53	CUDA BFloat16 Dropout (#45005 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45005 Reviewed By: mruberry Differential Revision: D24934761 Pulled By: ngimel fbshipit-source-id: 8f615b97fb93dcd04a46e1d8eeb817ade5082990	2020-11-12 22:28:11 -08:00
kshitij12345	4b25d83e9b	torch.dropout: fix non-contiguous layout input (#47552 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/47176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47552 Reviewed By: ailzhang Differential Revision: D24903435 Pulled By: ngimel fbshipit-source-id: ef5398931dddf452f5f734b4aa40c11f4ee61664	2020-11-11 22:56:31 -08:00
Qi Zhou	0ec717c830	Support int32 indices and offsets in nn.EmbeddingBag (#46758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758 It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type. Test Plan: unit tests Reviewed By: ngimel Differential Revision: D24470808 fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b	2020-11-03 23:33:50 -08:00
pomelyu	f41f3e3cd1	Implement bicubic grid sampler (#44780 ) Summary: Fix https://github.com/pytorch/pytorch/issues/44601 I added bicubic grid sampler in both cpu and cuda side, but haven't in AVX2 There is a [colab notebook](https://colab.research.google.com/drive/1mIh6TLLj5WWM_NcmKDRvY5Gltbb781oU?usp=sharing) show some test results. The notebook use bilinear for test, since I could only use distributed version of pytorch in it. You could just download it and modify the `mode_torch=bicubic` to show the results. There are some duplicate code about getting and setting values, since the helper function used in bilinear at first clip the coordinate beyond boundary, and then get or set the value. However, in bicubic, there are more points should be consider. I could refactor that part after making sure the overall calculation are correct. Thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/44780 Reviewed By: mrshenli Differential Revision: D24681114 Pulled By: mruberry fbshipit-source-id: d39c8715e2093a5a5906cb0ef040d62bde578567	2020-11-03 15:34:59 -08:00
kshitij12345	c68c3d0a02	[fix] nn.Embedding.from_pretrained : honour `padding_idx` argument (#47184 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46585 (first snippet) Now the behaviour of `padding_idx` agrees with documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47184 Reviewed By: mruberry Differential Revision: D24682567 Pulled By: albanD fbshipit-source-id: 864bd34eb9099d367a3fcbb8f4f4ba2e2b270724	2020-11-03 12:47:19 -08:00
Xiao Wang	774b638eb6	Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332 ) Summary: Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`. There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it. Definition of `largeTensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)` `_has_sufficient_memory` `d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)` `largeCUDATensorTest` `d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332 Reviewed By: ngimel Differential Revision: D24698690 Pulled By: mruberry fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307	2020-11-03 11:43:49 -08:00
Heitor Schueroff	18470f68bc	Fix max_pool1d on discontiguous tensor (#47065 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47065 #fixes https://github.com/pytorch/pytorch/issues/47054 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24633342 Pulled By: heitorschueroff fbshipit-source-id: b318f3a4fe68e538c71b147a82b62367f23146fa	2020-11-02 14:21:31 -08:00
Heitor Schueroff	2643800881	Fix max_pool2d with ceil_mode bug (#46558 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46558 This PR fixes a bug with how pooling output shape was computed. ## BC Breaking Notes Previously, a bug in the pooling code allowed a sliding window to be entirely off bounds. Now, sliding windows must start inside the input or left padding (not right padding, see https://github.com/pytorch/pytorch/issues/46929) and may only go off-bounds if ceil_mode=True. fixes #45357 TODO - [x] Ensure existing tests are checking for the correct output size Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D24633372 Pulled By: heitorschueroff fbshipit-source-id: 55925243a53df5d6131a1983076f11cab7516d6b	2020-10-30 09:36:04 -07:00
kshitij12345	1d233d7d1f	[fix] torch.nn.functional.embedding -> padding_idx behavior (#46714 ) Summary: Reference https://github.com/pytorch/pytorch/issues/46585 Fix for second snippet in the mentioned issue. ```python predefined_weights = torch.rand(10, 3) result = torch.nn.functional.embedding(torch.LongTensor([1,2,0]), predefined_weights, padding_idx=0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/46714 Reviewed By: VitalyFedyunin Differential Revision: D24593352 Pulled By: albanD fbshipit-source-id: 655b69d9ec57891871e26feeda2aa0dcff73beba	2020-10-29 13:29:00 -07:00
ashish	dfdc1dbee4	Disable softmax tests on ROCm (#46793 ) Summary: This PR disables the test_softmax and test_softmax_results in test_nn.py that were enabled in https://github.com/pytorch/pytorch/issues/46363. The softmax tests are causing failure on gfx906 machines. Disabling those until we root cause and fix them on 906. cc: jeffdaily ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/46793 Reviewed By: izdeby Differential Revision: D24539211 Pulled By: ezyang fbshipit-source-id: 633cb9dc497ad6359af85b85a711c4549d772b2a	2020-10-29 08:05:36 -07:00
Xiang Gao	7731370e71	CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997 Reviewed By: izdeby Differential Revision: D24547748 Pulled By: ngimel fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a	2020-10-26 16:01:22 -07:00
ashish	88e94da580	Enable softmax and tiny norm FP16 tests on ROCm (#46363 ) Summary: This pull request enables the following tests on ROCm: * TestCuda.test_tiny_half_norm_ * TestNNDeviceTypeCUDA.test_softmax_cuda_float16 * TestNNDeviceTypeCUDA.test_softmax_cuda_float32 * TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16 * TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32 The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8. The pull request fixes https://github.com/pytorch/pytorch/issues/37493 cc: jeffdaily ezyang malfet mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363 Reviewed By: heitorschueroff Differential Revision: D24325639 Pulled By: ezyang fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9	2020-10-22 19:40:00 -07:00
albanD	27e2ea4cea	Make add_relu an internal function (#46676 ) Summary: Cleanup for 1.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46676 Reviewed By: gchanan Differential Revision: D24458565 Pulled By: albanD fbshipit-source-id: b1e4b4630233d3f1a4bac20e3077411d1ae17f7b	2020-10-22 18:08:15 -07:00
Xiao Wang	f326f6a8a0	Remove dilation restriction on cuDNN ConvTranspose2d (#46290 ) Summary: Close https://github.com/pytorch/pytorch/issues/31690 I have verified the functionality of ConvTranspose2d (with this PR) on roughly 32,000 random shapes on V100, A100, using cuDNN 8.0.4 and CUDA 11.1. The 32,000 shapes contain 4x8,000 of (fp16, fp32) x (nchw, nhwc) each. The random shapes are sampled from ```jsonc { "batch_size": {"low": 1, "high": 8}, "in_channels": {"low": 16, "high": 128}, "out_channels": {"low": 16, "high": 128}, "height": {"low": 16, "high": 224}, "stride": {"set": [[1, 1], [2, 2]]}, "padding": {"set": [[0, 0]]}, "output_padding": {"set": [[0, 0], [1, 1], [0, 1], [1, 0]]}, "kernel_size": {"set": [[3, 3], [1, 1], [1, 3], [3, 1], [2, 2]]}, "dilation": {"set": [[1, 1]]}, "deterministic": {"set": [true, false]}, "benchmark": {"set": [true, false]}, "allow_tf32": {"set": [true, false]}, "groups": {"set": [1, IN_CHANNELS]} } ``` - Input `width` is the same as `height`. - `groups` can be either 1, or the same as `in_channels` (grouped convolution). When `groups` is 1, `out_channels` is random; when `groups` is the same as `in_channels`, `out_channels` is also the same as `in_channels` All of the checked shapes can be found in csv files here https://github.com/xwang233/code-snippet/tree/master/convtranspose2d-dilation/functionality-check-cudnn8.0.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46290 Reviewed By: mruberry Differential Revision: D24422091 Pulled By: ngimel fbshipit-source-id: 9f0120f2995ae1575c0502f1b2742390d7937b24	2020-10-22 13:42:03 -07:00
Sameer Deshmukh	982fa07ccb	torch.nn.Unfold accepts 0-dim for batch size (#40689 ) Summary: In partial completion of https://github.com/pytorch/pytorch/issues/12013 Allows specifying a tensor with 0-dim batch size for `torch.nn.Unfold()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40689 Reviewed By: zou3519 Differential Revision: D24441164 Pulled By: ngimel fbshipit-source-id: 49cd53b9b23f2e221aecdb4b5fed19a234038063	2020-10-22 13:05:24 -07:00
Alexander Grund	93719440b8	Replace map(lambda constructs (#46462 ) Summary: Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462 Reviewed By: zou3519 Differential Revision: D24422343 Pulled By: ezyang fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237	2020-10-22 09:50:22 -07:00
Xiaodong Wang	e3b2bfa2a3	[pytorch] Early return in nn.EmbeddingBag when weight is empty (#46572 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46572 When `num_samples == 0`, grid becomes zero. Although CUDA just silently proceeds, `cudaGetLastError()` will complain about the `Error: invalid configuration argument`. So it's actually failing in some future places that becomes really hard to debug. Reviewed By: jianyuh Differential Revision: D24409874 fbshipit-source-id: ca54de13b1ab48204bbad265e3f55b56b94a1a2f	2020-10-21 13:44:56 -07:00
Ivan Yashchuk	6de619e4a4	Allow converting parameters of nn.Module to complex dtypes (#44788 ) Summary: This PR makes it possible to cast the parameters of nn.Module to complex dtypes. The following code works with the proposed changes. ```python In [1]: import torch In [2]: lin = torch.nn.Linear(5, 1).to(torch.complex64) In [3]: lin(torch.zeros(3, 5, dtype=torch.complex64)) Out[3]: tensor([[-0.1739+0.j], [-0.1739+0.j], [-0.1739+0.j]], grad_fn=<AddmmBackward>) ``` Fixes https://github.com/pytorch/pytorch/issues/43477. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44788 Reviewed By: zou3519 Differential Revision: D24307225 Pulled By: anjali411 fbshipit-source-id: dacc4f5c8c9a99303f74d1f5d807cd657b3b69b5	2020-10-21 08:54:59 -07:00
Alexander Grund	5b0f400488	Replace list(map(...)) constructs by list comprehensions (#46461 ) Summary: As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant. It also fixes a bug detected by this where the argument order of `map` was confused: `030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)` Fixes https://github.com/pytorch/pytorch/issues/46392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461 Reviewed By: ailzhang Differential Revision: D24367015 Pulled By: ezyang fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7	2020-10-19 18:42:49 -07:00
Emilio Castillo	d38a71d579	`torch.nn.modules.LazyModuleMixin` and `torch.nn.LazyLinear` (Shape Inference II) (#44538 ) Summary: Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`. The main differences with the previous PR are two; Now `torch.nn.Module` remains untouched. We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side. As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`. This can cause problems when we create complex modules such as ``` class MyNetwork(torch.nn.Module): def __init__(self): super(MyNetwork, self).__init__() self.conv = torch.nn.Conv2d(20, 4, 2) self.linear = torch.nn.LazyLinear(10) def forward(self, x): y = self.conv(x).clamp(min=0) return self.linear(y) ``` Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed). TODO: Add convolutions once the design is OK Fix docstrings Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538 Reviewed By: ngimel Differential Revision: D24162854 Pulled By: albanD fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c	2020-10-19 13:13:54 -07:00
Brian Hirsh	00c779a92b	detect inplace modifications of views earlier (fix #21875 ) (#46204 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46204 Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D24259500 Pulled By: bdhirsh fbshipit-source-id: 223f8a07da4e4121009fc0a8b6760d90eef089b3	2020-10-19 08:58:33 -07:00
Kurt Mohler	66505b64a5	Fix incorrect CUDA `torch.nn.Embedding` result when max_norm is not None and indices are not sorted (#45248 ) Summary: Sorting indices before calling `thrust::unique` fixes the issue. Fixes https://github.com/pytorch/pytorch/issues/44792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45248 Reviewed By: mruberry Differential Revision: D24194696 Pulled By: ngimel fbshipit-source-id: ab59ef9d46b9917b1417bab25f80ce9780f0c930	2020-10-12 18:28:07 -07:00
Sameer Deshmukh	ba642d36ff	ReplicationPad accepts 0-dim batch size. (#39137 ) Summary: This PR patches the ReplicationPad modules in `torch.nn` to be compatible with 0-dim batch sizes. EDIT: this is part of the work on gh-12013 (make all nn layers accept empty batch size) Pull Request resolved: https://github.com/pytorch/pytorch/pull/39137 Reviewed By: albanD Differential Revision: D24131386 Pulled By: ngimel fbshipit-source-id: 3d93057cbe14d72571943c8979d5937e4bbf743a	2020-10-06 11:54:32 -07:00
Brian Hirsh	869b2ca048	some documentation and style fixes to smooth_l1_loss (#45587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45587 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24024313 Pulled By: bdhirsh fbshipit-source-id: c50efb2934d7b9d3b090e92678319cde42c0df45	2020-10-02 07:47:31 -07:00
Natalia Gimelshein	9201c37d02	Use addmm directly for 1x1 convolution (#45557 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45274 Based on https://github.com/pytorch/pytorch/issues/44041, sets intermediate for backward computation (otherwise, backward tests are failing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/45557 Reviewed By: izdeby Differential Revision: D24030655 Pulled By: ngimel fbshipit-source-id: 368fe9440668dffc004879f8b1d2dd3787d915c9	2020-10-02 00:26:53 -07:00
Sam Tsai	2596113a79	Add fuse support for batchnorm with affine=False (#45474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45474 When batchnorm affine is set to false, weight and bias is set to None, which is not supported in this case. Added a fix to set weights to 1 and bias to 0 if they are not set. Test Plan: Add unit test for testing fusing conv, batchnorm where batchnorm is in affine=False mode. Reviewed By: z-a-f Differential Revision: D23977080 fbshipit-source-id: 2782be626dc67553f3d27d8f8b1ddc7dea022c2a	2020-09-30 14:15:05 -07:00
lixinyu	417e3f85e5	Support tuple inputs in NN Module test (#44853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44853 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23750441 Pulled By: glaringlee fbshipit-source-id: 1b111a370a726b40521134b711c35f48dda99411	2020-09-28 22:05:05 -07:00
Xiang Gao	36c3fbc9e3	CUDA BFloat Conv (non-cuDNN) (#45007 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007 Reviewed By: zou3519 Differential Revision: D23933174 Pulled By: ngimel fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78	2020-09-28 11:42:42 -07:00
Vinod Kumar S	bf8cd21f2a	Py transformer coder test (#43976 ) Summary: Fixes #{[37756](https://github.com/pytorch/pytorch/issues/37756)} Added the missing Transformer coder python test scripts from C++ API test scripts Pull Request resolved: https://github.com/pytorch/pytorch/pull/43976 Reviewed By: jamesr66a Differential Revision: D23873250 Pulled By: glaringlee fbshipit-source-id: cdeae53231e02208463e7629ba2c1f00990150ea	2020-09-25 08:22:24 -07:00
Gao, Xiang	3f5eee666c	Adjust TF32 tests (#44240 ) Summary: - The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky. - Add `tf32_on_and_off` to new `matrix_exp` tests. - Disable TF32 on test suites other than `test_nn.py` and `test_torch.py` cc: ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240 Reviewed By: mruberry Differential Revision: D23882498 Pulled By: ngimel fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8	2020-09-24 10:25:58 -07:00
Rong Rong	b8eab8cdbd	[hotfix] typo in NaiveConvolutionTranspose2d.cu (#45224 ) Summary: Fixes typo in `e2f49c8` Fixes https://github.com/pytorch/pytorch/issues/45172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45224 Reviewed By: ezyang Differential Revision: D23879872 Pulled By: walterddr fbshipit-source-id: c3db6d4c6f2ac0e6887862d4217a79c030647cb9	2020-09-24 10:06:29 -07:00
Xiang Gao	67a19fecef	CUDA BFloat16 pooling (#45151 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45151 Reviewed By: ailzhang Differential Revision: D23854056 Pulled By: ngimel fbshipit-source-id: 32f0835218c2602a09654a9ac2d161c4eb360f90	2020-09-22 20:19:25 -07:00
Mike Ruberry	ef885c10d8	[pytorch] Add triplet margin loss with custom distance (#43680 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43680 As discussed [here](https://github.com/pytorch/pytorch/issues/43342), adding in a Python-only implementation of the triplet-margin loss that takes a custom distance function. Still discussing whether this is necessary to add to PyTorch Core. Test Plan: python test/run_tests.py Imported from OSS Reviewed By: albanD Differential Revision: D23363898 fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb	2020-09-22 11:35:52 -07:00
albanD	e155fbe915	add warning when ParameterList/Dict is used with DataParallel (#44405 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44405 Test Plan: Imported from OSS Reviewed By: agolynski Differential Revision: D23783987 Pulled By: albanD fbshipit-source-id: 5018b0d381cb09301d2f88a98a910854f740ace1	2020-09-22 08:58:00 -07:00
Xiang Gao	faef89c89f	CUDA BFloat Pooling (#44836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44836 Reviewed By: mruberry Differential Revision: D23800992 Pulled By: ngimel fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86	2020-09-19 15:43:36 -07:00
Xiang Gao	7ecfaef7ec	CUDA BFloat16 layernorm (#45002 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45002 Reviewed By: mruberry Differential Revision: D23800931 Pulled By: ngimel fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c	2020-09-19 15:36:03 -07:00
Gao, Xiang	06389406bb	CUDA BFloat activations 1 (#44834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44834 Reviewed By: mruberry Differential Revision: D23752660 Pulled By: ngimel fbshipit-source-id: 209a937e8a9afe12b7dd86ecfa493c9417fd22fb	2020-09-18 15:48:49 -07:00
Xiang Gao	f2b3480795	CUDA BFloat softmax (#44837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44837 Reviewed By: glaringlee Differential Revision: D23767981 Pulled By: ngimel fbshipit-source-id: be92c25a1b66ed50a52e090db167079def6f6b39	2020-09-17 21:52:47 -07:00
Xiao Wang	1694fde7eb	Fix a GroupNorm cuda bug when input does not require_grad (#44863 ) Summary: Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800 `dX` is a Tensor, comparing `dX` with `nullptr` was wrong. cc BIT-silence who wrote the kernel. The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44863 Reviewed By: mruberry Differential Revision: D23754101 Pulled By: BIT-silence fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49	2020-09-17 19:01:28 -07:00
Vitaliy Chiley	c71ce10cfc	add dilation to transposeconv's _output_padding method (#43793 ) Summary: This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs. Fixes https://github.com/pytorch/pytorch/issues/14272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793 Reviewed By: zou3519 Differential Revision: D23493313 Pulled By: ezyang fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307	2020-09-14 21:28:27 -07:00
Gregory Chanan	c8914afdfa	Merge criterion_tests and new_criterion_tests. (#44398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398 These end up executing the same tests, so no reason to have them separate. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23600855 Pulled By: gchanan fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965	2020-09-10 08:29:59 -07:00
Chris Huynh	7b547f086f	To fix extra memory allocation when using circular padding (#39273 ) Summary: For fixing https://github.com/pytorch/pytorch/issues/39256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39273 Reviewed By: anjali411 Differential Revision: D23471811 Pulled By: mruberry fbshipit-source-id: fb324b51baea765311715cdf14642b334f335733	2020-09-10 00:15:31 -07:00
taiyuanz	c515881137	Add reset_grad() function (#44423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42754 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23010859 Pulled By: ngimel fbshipit-source-id: 56eec43eba88b98cbf714841813977c68f983564	2020-09-09 22:05:45 -07:00
lixinyu	032480d365	fix typo in embedding_bag_non_contiguous_weight test (#44382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44382 This is to fix a typo that introduced in #44032. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D23601316 Pulled By: glaringlee fbshipit-source-id: 17d6de5900443ea46c7a6ee9c7614fe6f2d92890	2020-09-09 13:30:36 -07:00
Xiao Wang	ef4475f902	[Reland] Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#44211 ) Summary: Reland of https://github.com/pytorch/pytorch/issues/43986 DO NOT MERGE YET. XLA failure seems real. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44211 Reviewed By: mrshenli Differential Revision: D23590505 Pulled By: ngimel fbshipit-source-id: 6ee516b0995bfff6efaf740474c82cb23055d274	2020-09-09 12:08:14 -07:00
kshitij12345	6dd53fb58d	[fix] output of `embedding_bag` with non-contiguous weight (#44032 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/43723 use weight.contiguous on fast-path as it expects contiguous tensor. TODO: * [x] Add tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/44032 Reviewed By: izdeby Differential Revision: D23502200 Pulled By: glaringlee fbshipit-source-id: 4a7b546b3e8b1ad35c287a634b4e990a1ccef874	2020-09-08 16:07:13 -07:00
Natalia Gimelshein	0c2bc4fe20	Revert D23468286: [pytorch][PR] Optimize code path for adaptive_avg_pool2d when output size is (1, 1) Test Plan: revert-hammer Differential Revision: D23468286 (`f8f35fddd4`) Original commit changeset: cc181f705fea fbshipit-source-id: 3a1db0eef849e0c2f3c0c64040d2a8b799644fa3	2020-09-04 11:28:15 -07:00
Xiao Wang	f8f35fddd4	Optimize code path for adaptive_avg_pool2d when output size is (1, 1) (#43986 ) Summary: Benchmark: code: https://github.com/xwang233/code-snippet/blob/master/adaptive-avg-pool2d-output-1x1/adap.ipynb \| shape \| time_before (ms) \| time_after (ms) \| \| --- \| --- \| --- \| \| (2, 3, 4, 4), torch.contiguous_format, cpu \| 0.035 \| 0.031 \| \| (2, 3, 4, 4), torch.contiguous_format, cuda \| 0.041 \| 0.031 \| \| (2, 3, 4, 4), torch.channels_last, cpu \| 0.027 \| 0.029 \| \| (2, 3, 4, 4), torch.channels_last, cuda \| 0.031 \| 0.034 \| \| (2, 3, 4, 4), non_contiguous, cpu \| 0.037 \| 0.026 \| \| (2, 3, 4, 4), non_contiguous, cuda \| 0.062 \| 0.033 \| \| (4, 16, 32, 32), torch.contiguous_format, cpu \| 0.063 \| 0.055 \| \| (4, 16, 32, 32), torch.contiguous_format, cuda \| 0.043 \| 0.031 \| \| (4, 16, 32, 32), torch.channels_last, cpu \| 0.052 \| 0.064 \| \| (4, 16, 32, 32), torch.channels_last, cuda \| 0.190 \| 0.033 \| \| (4, 16, 32, 32), non_contiguous, cpu \| 0.048 \| 0.035 \| \| (4, 16, 32, 32), non_contiguous, cuda \| 0.062 \| 0.033 \| \| (8, 128, 64, 64), torch.contiguous_format, cpu \| 0.120 \| 0.109 \| \| (8, 128, 64, 64), torch.contiguous_format, cuda \| 0.043 \| 0.044 \| \| (8, 128, 64, 64), torch.channels_last, cpu \| 1.303 \| 0.260 \| \| (8, 128, 64, 64), torch.channels_last, cuda \| 1.237 \| 0.049 \| \| (8, 128, 64, 64), non_contiguous, cpu \| 0.132 \| 0.128 \| \| (8, 128, 64, 64), non_contiguous, cuda \| 0.062 \| 0.031 \| \| (16, 256, 224, 224), torch.contiguous_format, cpu \| 17.232 \| 14.807 \| \| (16, 256, 224, 224), torch.contiguous_format, cuda \| 1.930 \| 1.930 \| \| (16, 256, 224, 224), torch.channels_last, cpu \| 245.025 \| 24.345 \| \| (16, 256, 224, 224), torch.channels_last, cuda \| 15.593 \| 1.944 \| \| (16, 256, 224, 224), non_contiguous, cpu \| 11.738 \| 6.460 \| \| (16, 256, 224, 224), non_contiguous, cuda \| 0.524 \| 0.251 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/43986 Reviewed By: anjali411 Differential Revision: D23468286 Pulled By: ngimel fbshipit-source-id: cc181f705feacb2f86df420d648cc59fda69fdb7	2020-09-04 03:37:33 -07:00
Gregory Chanan	5973b44d9e	Rename NewCriterionTest to CriterionTest. (#44056 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44056 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23482573 Pulled By: gchanan fbshipit-source-id: dde0f1624330dc85f48e5a0b9d98fb55fdb72f68	2020-09-03 10:29:20 -07:00
Gao, Xiang	5e97f251a8	Enable TF32 support for cuDNN (#40737 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40737 Reviewed By: mruberry Differential Revision: D22801525 Pulled By: ngimel fbshipit-source-id: ac7f7e728b4b3e01925337e8c9996f26a6433fd2	2020-09-01 15:34:24 -07:00
Heitor Schueroff de Souza	13a48ac1f3	MaxPool1d without indices optimization (#43745 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43745 This is part of a larger effort to refactor and optimize the pooling code. Previously I started working on MaxPool2d here https://github.com/pytorch/pytorch/pull/43267 but since it uses MaxPool1d as a subroutine, it made more sense to work on 1D first and get it tested and optimized and then move up to 2D and then 3D. Below are some benchmarking results, the python script I used is under the results. ## Benchmarking ``` Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_googlenet[(3, 2, 0, 1, 0)-new] 79.7659 (1.03) 1,059.6327 (5.32) 90.6280 (1.01) 19.1196 (1.41) 84.2176 (1.01) 2.4289 (1.0) 1079;2818 11.0341 (0.99) 9055 1 test_googlenet[(3, 2, 0, 1, 0)-old] 505.1531 (6.55) 830.8962 (4.17) 563.4763 (6.29) 65.3974 (4.81) 538.3361 (6.43) 80.5371 (33.16) 242;99 1.7747 (0.16) 1742 1 test_googlenet[(3, 2, 0, 1, 1)-new] 80.2949 (1.04) 233.0020 (1.17) 97.6498 (1.09) 19.1228 (1.41) 89.2282 (1.07) 18.5743 (7.65) 1858;741 10.2407 (0.92) 9587 1 test_googlenet[(3, 2, 0, 1, 1)-old] 513.5350 (6.66) 977.4677 (4.91) 594.4559 (6.63) 69.9372 (5.15) 577.9080 (6.90) 79.8218 (32.86) 503;84 1.6822 (0.15) 1675 1 test_googlenet[(3, 2, 1, 1, 0)-new] 77.1061 (1.0) 199.1168 (1.0) 89.6529 (1.0) 13.5864 (1.0) 83.7557 (1.0) 7.5139 (3.09) 1419;1556 11.1541 (1.0) 7434 1 test_googlenet[(3, 2, 1, 1, 0)-old] 543.6055 (7.05) 964.5708 (4.84) 636.9867 (7.11) 84.0732 (6.19) 616.7777 (7.36) 100.4562 (41.36) 434;65 1.5699 (0.14) 1552 1 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_inception[(3, 2, 0, 1, 0)-new] 84.5827 (1.00) 184.2827 (1.0) 90.5438 (1.01) 9.6324 (1.0) 89.3027 (1.05) 4.5672 (1.03) 637;759 11.0444 (0.99) 6274 1 test_inception[(3, 2, 0, 1, 0)-old] 641.2268 (7.59) 1,704.8977 (9.25) 686.9383 (7.65) 57.2499 (5.94) 682.5905 (8.01) 58.3753 (13.17) 86;21 1.4557 (0.13) 802 1 test_inception[(3, 2, 0, 1, 1)-new] 84.5008 (1.0) 1,093.6335 (5.93) 89.8233 (1.0) 14.0443 (1.46) 85.2682 (1.0) 4.4331 (1.0) 802;1106 11.1330 (1.0) 9190 1 test_inception[(3, 2, 0, 1, 1)-old] 643.7078 (7.62) 851.4188 (4.62) 687.4905 (7.65) 41.1116 (4.27) 685.1386 (8.04) 60.2733 (13.60) 286;14 1.4546 (0.13) 1300 1 test_inception[(3, 2, 1, 1, 0)-new] 106.0739 (1.26) 258.5649 (1.40) 115.3597 (1.28) 17.5436 (1.82) 106.9643 (1.25) 5.5470 (1.25) 894;1402 8.6685 (0.78) 7635 1 test_inception[(3, 2, 1, 1, 0)-old] 651.0504 (7.70) 955.2278 (5.18) 698.0295 (7.77) 45.5097 (4.72) 692.8109 (8.13) 64.6794 (14.59) 145;15 1.4326 (0.13) 909 1 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_large_batch_size[new] 2.9608 (1.0) 5.1127 (1.0) 3.3096 (1.0) 0.1936 (1.0) 3.3131 (1.0) 0.2093 (1.0) 71;6 302.1515 (1.0) 297 1 test_large_batch_size[old] 130.6583 (44.13) 152.9521 (29.92) 137.1385 (41.44) 7.4352 (38.40) 135.1784 (40.80) 5.1358 (24.53) 1;1 7.2919 (0.02) 7 1 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_large_channel_size[new] 2.9696 (1.0) 5.5595 (1.0) 3.5997 (1.0) 0.5836 (1.0) 3.3497 (1.0) 0.3445 (1.0) 58;54 277.8014 (1.0) 277 1 test_large_channel_size[old] 19.6838 (6.63) 22.6637 (4.08) 21.1775 (5.88) 0.8610 (1.48) 21.3739 (6.38) 1.4930 (4.33) 13;0 47.2199 (0.17) 36 1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_large_width[new] 1.7714 (1.0) 2.4104 (1.0) 1.8988 (1.0) 0.0767 (1.0) 1.8911 (1.0) 0.0885 (1.0) 86;13 526.6454 (1.0) 373 1 test_large_width[old] 19.5708 (11.05) 22.8755 (9.49) 20.7987 (10.95) 0.7009 (9.14) 20.6623 (10.93) 0.8584 (9.70) 14;1 48.0799 (0.09) 46 1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ test_multithreaded[new] 15.0560 (1.0) 24.2891 (1.0) 16.1627 (1.0) 1.5657 (1.0) 15.7182 (1.0) 0.7598 (1.0) 4;6 61.8709 (1.0) 65 1 test_multithreaded[old] 115.7614 (7.69) 120.9670 (4.98) 118.3004 (7.32) 1.6259 (1.04) 118.4164 (7.53) 1.9613 (2.58) 2;0 8.4531 (0.14) 8 1 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean ``` ### Benchmarking script To run the benchmark make sure you have pytest-benchmark installed with `pip install pytest-benchmark` and use the following command: `pytest benchmark.py --benchmark-sort='name'` ``` import torch import pytest def _test_speedup(benchmark, batches=1, channels=32, width=32, kernel_size=2, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False): torch.set_num_threads(1) x = torch.randn((batches, channels, width)) model = torch.nn.MaxPool1d(kernel_size, stride, padding, dilation, return_indices, ceil_mode) benchmark(model, x) pytest.mark.benchmark(group="inception") pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"]) pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)], ids=["(3, 2, 0, 1, 0)", "(3, 2, 0, 1, 1)", "(3, 2, 1, 1, 0)"]) def test_inception(benchmark, params, return_indices): _test_speedup(benchmark, 10, 64, 147, params, return_indices=return_indices) pytest.mark.benchmark(group="googlenet") pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"]) pytest.mark.parametrize("params", [(3, 2), (3, 2, 0, 1, True), (3, 2, 1)], ids=["(3, 2, 0, 1, 0)", "(3, 2, 0, 1, 1)", "(3, 2, 1, 1, 0)"]) def test_googlenet(benchmark, params, return_indices): _test_speedup(benchmark, 10, 64, 112, params, return_indices=return_indices) pytest.mark.benchmark(group="large batch size") pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"]) def test_large_batch_size(benchmark, return_indices): _test_speedup(benchmark, 100000, 1, 32, return_indices=return_indices) pytest.mark.benchmark(group="large channel size") pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"]) def test_large_channel_size(benchmark, return_indices): _test_speedup(benchmark, 1, 100000, 32, return_indices=return_indices) pytest.mark.benchmark(group="large width") pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"]) def test_large_width(benchmark, return_indices): _test_speedup(benchmark, 1, 32, 100000, return_indices=return_indices) pytest.mark.benchmark(group="multithreading") pytest.mark.parametrize("return_indices", [True, False], ids=["old", "new"]) def test_multithreaded(benchmark, return_indices): x = torch.randn((40, 10000, 32)) model = torch.nn.MaxPool1d(2, return_indices=return_indices) benchmark(model, x) ``` ## Discussion The new algorithm is on average 7x faster than the old one. But because the old algorithm had many issues with how it parallelized the code and made use of the cache, one can come up with input parameters (like large batch size) that will make the new algorithm much faster than the original one. Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D23425348 Pulled By: heitorschueroff fbshipit-source-id: 3fa3f9b8e71200da48424a95510124a83f50d7b2	2020-09-01 08:40:01 -07:00
Gregory Chanan	a67246b2d4	Add reduction string test for ctc_loss. (#43884 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43884 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D23427907 Pulled By: gchanan fbshipit-source-id: 889bd92e9d3e0528b57e3952fc83e25bc7abe293	2020-09-01 07:01:54 -07:00
Gregory Chanan	42c895de4d	Properly check that reduction strings are valid for l1_loss, smoothl1_loss, and mse_loss. (#43527 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43527 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D23306786 Pulled By: gchanan fbshipit-source-id: f3b7c9c02ae02813da116cb6b247a95727c47587	2020-08-31 09:53:56 -07:00
Peter Bell	065ebdb92f	TensorIterator: Check for memory overlap in all `binary_op`s (#43419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23298655 Pulled By: zou3519 fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a	2020-08-28 08:40:19 -07:00
Peter Bell	bdee8e02c0	TensorIterator: Check memory overlap in all `unary_op`s (#43418 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43418 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23298651 Pulled By: zou3519 fbshipit-source-id: 84be498f5375813fd10cf30b8beabbd2d15210a3	2020-08-28 08:39:13 -07:00
Nikita Shulga	4afbf39737	Add nn.functional.adaptive_avg_pool size empty tests (#42857 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42857 Reviewed By: seemethere Differential Revision: D23053677 Pulled By: malfet fbshipit-source-id: b3d0d517cddc96796461332150e74ae94aac8090	2020-08-11 12:59:58 -07:00
Kurt Mohler	42b4a7132e	Raise error if `at::native::embedding` is given 0-D weight (#42550 ) Summary: Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead. Fixes https://github.com/pytorch/pytorch/issues/41780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550 Reviewed By: smessmer Differential Revision: D23040744 Pulled By: albanD fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01	2020-08-11 08:26:45 -07:00
Nikita Shulga	3cf2551f2f	Fix `torch.nn.functional.grid_sample` crashes if `grid` has NaNs (#42703 ) Summary: In `clip_coordinates` replace `minimum(maximum(in))` composition with `clamp_max(clamp_min(in))` Swap order of `clamp_min` operands to clamp NaNs in grid to 0 Fixes https://github.com/pytorch/pytorch/issues/42616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42703 Reviewed By: ezyang Differential Revision: D22987447 Pulled By: malfet fbshipit-source-id: a8a2d6de8043d6b77c8707326c5412d0250efae6	2020-08-10 16:20:09 -07:00
Peter Bell	33519e19ab	Fix 64-bit indexing in GridSampler (#41923 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/41656 For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback. For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923 Reviewed By: glaringlee Differential Revision: D22925931 Pulled By: zou3519 fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb	2020-08-06 16:08:09 -07:00
Jianyu Huang	1c5c289b62	[pt] Add incude_last_offset option to EmbeddingBag mean and max (#42215 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42215 Specifically on https://github.com/pytorch/pytorch/pull/27477#discussion_r371402079 We would like to supported with include_last=True overall for other reduction types like mean and max. It now causes further code fragmentation in DPER (https://www.internalfb.com/intern/diff/D22794469/). More details: https://www.internalfb.com/intern/diff/D22794469/?dest_fbid=309597093427021&transaction_id=631457624153457 ghstack-source-id: 108733009 Test Plan: ``` buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" ``` ``` (base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ TORCH_SHOW_CPP_STACKTRACES=1 buck test mode/dev-nosan //caffe2/test: nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" --print-passing-details Parsing buck files: finished in 1.2 sec Building: finished in 5.5 sec (100%) 10130/10130 jobs, 2 updated Total time: 6.7 sec More details at https://www.internalfb.com/intern/buck/build/dbdc2063-69d8-45cb-9146-308a9e8505ef First unknown argument: --print-passing-details. Falling back to TestPilot classic. Trace available for this run at /tmp/testpilot.20200728-195414.1422748.log TestPilot test runner for Facebook. See https://fburl.com/testpilot for details. Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par Discovering tests Running 1 test Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375 ✓ caffe2/test:nn - test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) 0.162 1/1 (passed) Test output: > /data/users/jianyuhuang/fbsource/fbcode/buck-out/dev/gen/caffe2/test/nn#binary,link-tree/torch/_utils_internal.py:103: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen. > threadSafeForkRegisterAtFork() > /usr/local/fbcode/platform007/lib/python3.7/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__ > return f(args, *kwds) > test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) ... Couldn't download test skip set, leaving all tests enabled... > ok > > ---------------------------------------------------------------------- > Ran 1 test in 0.162s > > OK Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375 Summary (total time 5.54s): PASS: 1 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 Did _not_ run with tpx. See https://fburl.com/tpx for details. ``` Reviewed By: dzhulgakov Differential Revision: D22801881 fbshipit-source-id: 80a624465727081bb9bf55c28419695a3d79c6e5	2020-07-29 01:20:00 -07:00
X Wang	b0424a895c	Raise RuntimeError for zero stride pooling (#41819 ) Summary: Close https://github.com/pytorch/pytorch/issues/41767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41819 Reviewed By: mrshenli Differential Revision: D22780634 Pulled By: ngimel fbshipit-source-id: 376ce5229ad5bd60804d839340d2c6505cf3288d	2020-07-28 11:07:12 -07:00
Alvaro	3e121d9688	Amend docstring and add test for Flatten module (#42084 ) Summary: I've noticed when PR https://github.com/pytorch/pytorch/issues/22245 introduced `nn.Flatten`, the docstring had a bug where it wouldn't render properly on the web, and this PR addresses that. Additionally, it adds a unit test for this module. Actual ![image](https://user-images.githubusercontent.com/13088001/88483672-cf896a00-cf3f-11ea-8b1b-a30d152e1368.png) Expected ![image](https://user-images.githubusercontent.com/13088001/88483642-86391a80-cf3f-11ea-8333-0964a027a172.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/42084 Reviewed By: mrshenli Differential Revision: D22756662 Pulled By: ngimel fbshipit-source-id: 60c58c18c9a68854533196ed6b9e9fb0d4f83520	2020-07-27 11:04:28 -07:00
Kurt Mohler	ec683299eb	Reland Add non-deterministic alert to CUDA operations that use `atomicAdd()` (#41538 ) Summary: Reland PR https://github.com/pytorch/pytorch/issues/40056 A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538 Reviewed By: zou3519 Differential Revision: D22608376 Pulled By: ezyang fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82	2020-07-22 13:12:29 -07:00
Vinnam Kim	825a387ea2	Fix bug on the backpropagation of LayerNorm when create_graph=True (#41595 ) Summary: Solve an issue https://github.com/pytorch/pytorch/issues/41332 I found the bug at https://github.com/pytorch/pytorch/issues/41332 is caused by LayerNorm. Current implementations of LayerNorm have a disparity between 1. [`create_graph=False` CUDA implementation](`dde3d5f4a8/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L145)`) 2. [`create_graph=True` implementation](`dde3d5f4a8/tools/autograd/templates/Functions.cpp (L2536)`) With this bug-fix, https://github.com/pytorch/pytorch/issues/41332 is solved. Ailing BIT-silence Signed-off-by: Vinnam Kim <vinnamkim@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/41595 Reviewed By: houseroad Differential Revision: D22598415 Pulled By: BIT-silence fbshipit-source-id: 63e390724bd935dc8e028b4dfb75d34a80558c3a	2020-07-22 00:19:12 -07:00
Alvaro	c89c294ef9	Add Unflatten Module (#41564 ) Summary: This PR implements a feature extension discussed in https://github.com/pytorch/pytorch/issues/41516. I followed this other PR https://github.com/pytorch/pytorch/issues/22245 to add this other module. While I was at it, I also added `extra_repr()` method in `Flatten` which was missing. I see there are no unit tests for these modules. Should I add those too? If so, what is the best place I should place these? Pull Request resolved: https://github.com/pytorch/pytorch/pull/41564 Reviewed By: gchanan Differential Revision: D22636766 Pulled By: albanD fbshipit-source-id: f9efdefd3ffe7d9af9482087625344af8f990943	2020-07-21 07:43:02 -07:00
Mike Ruberry	b2b8af9645	Removes assertAlmostEqual (#41514 ) Summary: This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514 Reviewed By: ngimel Differential Revision: D22569348 Pulled By: mruberry fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f	2020-07-16 10:35:12 -07:00
Zhang, Xiaobing	b48ee175e6	[reland][DNNL]:enable conv3d (#40691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40691 Test Plan: Imported from OSS Differential Revision: D22296548 Pulled By: VitalyFedyunin fbshipit-source-id: 8e2a7cf14e8bdfa2f29b735a89e8c83f6119e68d	2020-07-15 13:54:41 -07:00
Shen Li	954c260061	Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use `atomicAdd()` Test Plan: revert-hammer Differential Revision: D22480638 (`6ff306b8b5`) Original commit changeset: 4cc913cb3ca6 fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea	2020-07-15 12:10:05 -07:00
Kurt Mohler	6ff306b8b5	Add non-deterministic alert to CUDA operations that use `atomicAdd()` (#40056 ) Summary: Issue https://github.com/pytorch/pytorch/issues/15359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056 Differential Revision: D22480638 Pulled By: ezyang fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01	2020-07-15 10:57:32 -07:00
Wojciech Baranowski	20f3051f7d	[adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/40131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665 Differential Revision: D22463538 Pulled By: ezyang fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4	2020-07-14 21:51:40 -07:00
Kurt Mohler	0b73ea0ea2	Change BCELoss size mismatch warning into an error (#41426 ) Summary: BCELoss currently uses different broadcasting semantics than numpy. Since previous versions of PyTorch have thrown a warning in these cases telling the user that input sizes should match, and since the CUDA and CPU results differ when sizes do not match, it makes sense to upgrade the size mismatch warning to an error. We can consider supporting numpy broadcasting semantics in BCELoss in the future if needed. Closes https://github.com/pytorch/pytorch/issues/40023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41426 Reviewed By: zou3519 Differential Revision: D22540841 Pulled By: ezyang fbshipit-source-id: 6c6d94c78fa0ae30ebe385d05a9e3501a42b3652	2020-07-14 20:34:06 -07:00
Peter Bell	87bf04fe12	AvgPool: Ensure all cells are valid in ceil mode (#41368 ) Summary: Closes https://github.com/pytorch/pytorch/issues/36977 This avoid the division by zero that was causing NaNs to appear in the output. `AvgPooling2d` and `AvgPooling3d` both had this issue on CPU and CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41368 Reviewed By: ailzhang Differential Revision: D22520013 Pulled By: ezyang fbshipit-source-id: 3ece7829f858f5bc17c2c1d905266ac510f11194	2020-07-14 09:24:30 -07:00
Kimish Patel	82c9f79e0e	Add fused add_relu op. (#39342 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39342 Many networks such as resnet have adds followed by relu. This op is the first step in enabling this fused implementation. Once we have the fused add_relu op, a JIT pass will be written to replace add + relu patterns with add_relu. Test Plan: python test/test_nn.py TestAddRelu Imported from OSS Differential Revision: D21822397 fbshipit-source-id: 03df83a3e46ddb48a90c5a6f755227a7e361a0e8	2020-07-09 16:25:11 -07:00
Liu	54d7a1e3f4	Fix module dict key ordering (#40905 ) Summary: fix https://github.com/pytorch/pytorch/issues/40227 Removed the sorting operation both in ModuleDict class, updated the docstring. Also remove a sort operation in corresponding unit test, which will lead to unit test fail. BC Note: Python version after 3.6, the plain dict will preserve the order of keys. example: For a python 3.6+ user, if he is initial a ModuleDict instance using plain python dict: { "b": torch.nn.MaxPool2d(3), "a": torch.nn.MaxPool2d(3) } , he will get a ModuleDict which preserve the order: ModuleDict( (b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False) (a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False) ) For a python 3.5 user, if we maintain the same input, then the output ModuleDict could be: ModuleDict( (a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False) (b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False) ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/40905 Differential Revision: D22357480 Pulled By: albanD fbshipit-source-id: 0e2502769647bb64f404978243ca1ebe5346d573	2020-07-06 06:40:48 -07:00
Sameer Deshmukh	cf8a9b50ca	Allow ReflectionPad to accept 0-dim batch sizes. (#39231 ) Summary: Allows ReflectionPad 1D and 2D to accept 0-dim batch sizes. Related to issues: * https://github.com/pytorch/pytorch/issues/38115 * https://github.com/pytorch/pytorch/issues/12013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39231 Reviewed By: ezyang Differential Revision: D22205717 Pulled By: mruberry fbshipit-source-id: 6744661002fcbeb4aaafd8693fb550ed53f3e00f	2020-06-24 22:24:05 -07:00
Xiao Wang	17d3f74ea3	Relax cudnn conditions for channels-last convolutions (#38904 ) Summary: Follow up of https://github.com/pytorch/pytorch/issues/38044. Thanks ptrblck, mcarilli for the help on discussing the changes! Could fix https://github.com/pytorch/pytorch/issues/37725 by skipping the depthwise-workload check introduced in https://github.com/pytorch/pytorch/issues/22302. This PR also relaxed dilated convolution for channels-last. The testing script is https://gist.github.com/xwang233/82a707f69bb710cb612349280a2c5f41. About 387k conv arguments were tested and no cudnn exception was thrown. cc ngimel VitalyFedyunin ptrblck mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/38904 Differential Revision: D22155797 Pulled By: VitalyFedyunin fbshipit-source-id: 81b5736cec67ea263029121521c6acafd9dddba6	2020-06-22 10:59:37 -07:00
F-G Fernandez	881c1adfcd	Fixed buffer update in BatchNorm when track_running_stats is set to False (#38084 ) Summary: This PR aims at tackling https://github.com/pytorch/pytorch/issues/37823 by: - ensuring that buffers will be used for normalization computation but won't be updated, when buffers are not None, and `track_running_stats=False` - adding a corresponding unittest to ensure expected behaviour Any feedback is welcome! _Note: we might want to update the docstrings of `BatchNorm*d`, feel free to share any suggestion!_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/38084 Differential Revision: D22047871 Pulled By: ezyang fbshipit-source-id: 5acbcad9773e7901f26d625db71d43d7dc236d3e	2020-06-22 08:17:31 -07:00
Xiao Wang	1670ea9474	Remove overload of GPU max_pool3d with kernel_width; fix nan, inf in GPU {fractional,adaptive} max_pool{2,3}d (#39903 ) Summary: Fix https://github.com/pytorch/pytorch/issues/39846. Fix https://github.com/pytorch/pytorch/issues/39044 The problem was that `max_pool3d_with_indices_single_out_frame` has an overload of kernel_width being a template argument. The two overloaded kernels were supposed to be identical, however, they were not. The general version `da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L69-L73)` The overloaded version `da3073e9b1/aten/src/ATen/native/cuda/DilatedMaxPool3d.cu (L130-L134)` While the max_pool3d being "switch-case"-ed to the overloaded version, the NaN value comparison is ignored. Also, maintaining two overloaded versions of such a complicated kernel would be hard. I'm not sure if the overloaded version would even give huge performance benefit. So I propose to remove the kernel_width overloaded version. Also, the current test of max_pool_XD_nan forgot the device kwarg. I added that. Edit: profiling before and after script: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/a.py plot: https://github.com/xwang233/code-snippet/blob/master/maxpool-3d-kw-template-arg/b.ipynb The performance difference is within +- 5%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39903 Differential Revision: D22080759 Pulled By: ngimel fbshipit-source-id: 4dacdd266a0522b3ff432eb9d58b131fa86821e9	2020-06-17 16:18:33 -07:00
Emilio Castillo	5e77999ecb	Add global hooks to `torch.nn.Module` (#38972 ) Summary: This allows registering hooks that will be executed for every module. This idea arose in a discussion with tkerola and niboshi kindly proposed this approach. The use case for this is to avoid boilerplate code when registering the same hook for all the modules in a complex model, the internal use-case was to allow every model to accept a NumPy array in the forward pass in a simpler way. Other use cases involve general mechanisms for plotting or tracing & debugging. Currently, this is shared for all the modules but this can be worked out to have the hooks shared only per type of module. If this functionality is not needed feel free to close the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38972 Differential Revision: D22091364 Pulled By: albanD fbshipit-source-id: 204ff5f9e119eff5bdd9140c64cb5dc467bb23a2	2020-06-17 12:20:35 -07:00
Emilio Castillo	5200814cfa	Fix test_hook_* issues (#40135 ) Summary: Follows https://github.com/pytorch/pytorch/issues/38972 Some of the changes asked by albanD in the above review are appliable to the regular hooks tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40135 Differential Revision: D22091389 Pulled By: albanD fbshipit-source-id: e1004213276bfb189167b9870e1a88b3d23b458c	2020-06-17 08:50:42 -07:00
jiej	bfcb687b9c	Nearest interpolation gpu implementation fix [Resolves issue #38985 ] (#39055 ) Summary: fix nearest upsample dgrad bug, where window computation was wrong previously; fix python test where previously GPU implementation was not tested; Pull Request resolved: https://github.com/pytorch/pytorch/pull/39055 Differential Revision: D21763242 Pulled By: albanD fbshipit-source-id: 9b1d5365f40176450f529136110542fd36bd7f58	2020-05-28 08:07:14 -07:00
Ailing	20397285c6	Replace use of np.allclose in tests. (#34287 ) Summary: fixes https://github.com/pytorch/pytorch/issues/34096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34287 Differential Revision: D21735525 Pulled By: ailzhang fbshipit-source-id: 611da17cfc5a3fee77d482abccf8f9854f504263	2020-05-27 15:29:35 -07:00
Mike Ruberry	13120bf677	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21740237 Pulled By: mruberry fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042	2020-05-27 06:31:07 -07:00
Rohan Varma	63e545e0fe	Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol Test Plan: revert-hammer Differential Revision: D21717199 Original commit changeset: 9feb856f94ee fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259	2020-05-26 18:23:59 -07:00
Xiao Wang	e4a3c584d5	Fix max_pool2d nchw backward bug (#38953 ) Summary: Fix https://github.com/pytorch/pytorch/issues/38764 The current problem is that, `top_diff` and `top_mask` pointers are shifted "accumulatively" with for-n and for-c loops. This may cause overflow and illegal memory access when the loop counts are greater than one, that is n > 65535 or c > 65535 (the case in https://github.com/pytorch/pytorch/issues/38764). Since neither of n > 65535 or c > 65535 is common, it has not been seen before. The simple fix would be using new pointer variables for the n & c offset instead of directly modifying `top_diff` or `top_mask`. However, I think the current nchw max_pool2d GPU impl still has plenty of room for performance improvement. We can check that in a later PR if needed. Slightly clean up the indentation. Also add tests to use CPU impl as a reference check. cc skrah Pull Request resolved: https://github.com/pytorch/pytorch/pull/38953 Differential Revision: D21721930 Pulled By: ezyang fbshipit-source-id: fef7d911d814f8ed9fd67c60cabe5d52f8fd3d57	2020-05-26 12:00:31 -07:00
Xiao Wang	583ff947e1	Fix max_pool2d for returning wrong shape with return_indices=True on cuda (#38992 ) Summary: Fix https://github.com/pytorch/pytorch/issues/38986 The current code only resizes pooling output but forget to resize indices as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38992 Differential Revision: D21718324 Pulled By: ngimel fbshipit-source-id: 7cf937966d38ab2167be79979475c4e0cacbf82c	2020-05-26 11:27:36 -07:00
Mike Ruberry	6ddca30b2d	Updates assertEqual to require atol and rtol, removes positional atol (#38872 ) Summary: This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument. In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872 Differential Revision: D21717199 Pulled By: mruberry fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a	2020-05-26 08:30:23 -07:00
Natalia Gimelshein	c34b333230	improve accuracy of logsoftmax computation on cuda (#38945 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/38839. Previously, if magnitude of input values was large, when computing `max+log(sum)` the `log(sum)` value was essentially ignored, now the result is computed as `x-max-log(sum)` which has a better chance of preserving accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38945 Differential Revision: D21712483 Pulled By: ngimel fbshipit-source-id: c1a3599ed981ba7a7fd130cbd7040a706b7eace0	2020-05-26 08:29:56 -07:00
jiej	5b8a79ab49	fix the device inconsistency for import convert_sync_batchnorm (#38729 ) Summary: This fixes the device inconsistency reported in https://github.com/pytorch/pytorch/issues/37930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38729 Differential Revision: D21671039 Pulled By: ngimel fbshipit-source-id: 17fdb4eae2ddaf64560dd026fe39958536ab313f	2020-05-20 15:42:53 -07:00
Jeff Daily	55914f8e83	Add skipCUDAIfRocm to test_nn test_softmax_results. (#38724 ) Summary: CC ezyang xw285cornell sunway513 Commit `59d92e442b` (https://github.com/pytorch/pytorch/issues/38557) has caused this test to regularly fail on ROCm CI gfx900 hosts. Skipping test until root cause analysis can complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38724 Differential Revision: D21645815 Pulled By: xw285cornell fbshipit-source-id: 4087e9565710c271ca5c026a5ae0c5132e56f44d	2020-05-19 13:20:34 -07:00
Natalia Gimelshein	54d4b419db	fix clip_grad_norm to work with parameters on the different devices (#38615 ) Summary: Per title. We move all the individual gradient norms to a single device before stacking (no-op if all the gradients are already on a single device), `clip_coef` is copied to the device of gradient, which may be suboptimal as there could be multiple copies, but no worse than when we were synchronizing for each parameter. In a simple case of all gradients on a single device, there should be no synchronization. Also, we no longer error out if parameter list is empty or none of the parameters have gradients, and return 0 total_norm instead. Fixes https://github.com/pytorch/pytorch/issues/38605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38615 Reviewed By: ailzhang Differential Revision: D21634588 Pulled By: ngimel fbshipit-source-id: ea4d08d4f3445438260052820c7ca285231a156b	2020-05-19 10:33:40 -07:00
Simon Layton	59d92e442b	Vectorize non-persistent Softmax (#38557 ) Summary: Resubmit of https://github.com/pytorch/pytorch/issues/36485 with bug fix & enhanced testing. Moved `test_softmax_backward` -> `test_softmax_results`, check fprop & bgrad against CPU implementation for all cases. \cc ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/38557 Differential Revision: D21620805 Pulled By: ngimel fbshipit-source-id: 4f736b3e59f79142e1b982eb643c592dedcbe111	2020-05-18 13:05:36 -07:00
Mike Ruberry	9cfc10d52e	Updates assertEqual to use torch.isclose-like logic (#37294 ) Summary: Edit: this has been updated to reflect the PR's current status, which has changed after review. This PR updates the behavior of the assertEqual, assertNotEqual, and assert_allclose to be consistent with each other and torch.isclose. It corrects several additional bugs in the current implementations and adds extensive testing and comments, too. These updates follow from changes to assertEqual like https://github.com/pytorch/pytorch/pull/34258 and https://github.com/pytorch/pytorch/pull/37069, and from our discussion of torch.isclose for complex tensors (see https://github.com/pytorch/pytorch/issues/36462), where we decided to implement a NumPy-compatible mathematical notion of "closeness" for complex tensors that is not a great fit for our testing framework. The detailed changelist is: - New test framework functions for comparing tensors and scalars - Tensors are compared using isclose; the real and imaginary parts of complex tensors are compared independently - Scalars are compared using the same algorithm - assertEqual and assert_allclose now use this common comparison function, instead of each implementing their own with divergent behavior - assertEqual-like debug messages are now available for all tensor and scalar comparisons, with additional context when comparing the components of sparse, quantized, and complex tensors - Extensive testing of the comparison behavior and debug messages - Small Updates - assertEqual now takes an "exact_device" argument, analogous to "exact_dtype", which should be useful in multidevice tests - assertEqual now takes an "equal_nan" argument for argument consistency with torch.isclose - assertEqual no longer takes the "allow_inf" keyword, which misleadingly only applied to scalar comparisons, was only ever set (rarely) to true, and is not supported by torch.isclose - Bug fixes: - the exact_dtype attribute has been removed (no longer needed after https://github.com/pytorch/pytorch/pull/38103) - message arguments passed to assertEqual are now handled correctly - bool x other dtype comparisons are now supported - uint8 and int8 tensor comparisons now function properly - rtol for integer comparisons is now supported (default is zero) - rtol and atol for scalar comparisons are now supported - complex scalar comparisons are now supported, analogous to complex tensor comparisons - assertNotEqual is now equivalent to the logical negation of assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/37294 Differential Revision: D21596830 Pulled By: mruberry fbshipit-source-id: f2576669f7113a06f82581fc71883e6b772de19b	2020-05-15 16:24:03 -07:00
Natalia Gimelshein	c0bc182761	Revert "Vectorize non-persistent Softmax kernels (#36485 )" (#38534 ) Summary: This reverts commit `c879c6fb98`. (it produces incorrect results) Pull Request resolved: https://github.com/pytorch/pytorch/pull/38534 Reviewed By: soumith Differential Revision: D21589251 Pulled By: ngimel fbshipit-source-id: 66d5324848d0245d15b7ef5f1fe4302ed0992b56	2020-05-14 23:17:59 -07:00
David Reiss	d060deb5bb	Remove _compatible_subtest (#35620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35620 Python 2 has reached end-of-life and is no longer supported by PyTorch. `self.subTest` can be used directly in Python 3. Test Plan: CI Differential Revision: D20842872 Pulled By: dreiss fbshipit-source-id: 6ad42550c01e6959821ff07df767fc14b58c5a9e	2020-05-14 10:07:48 -07:00
Robert Wang	2b2d2168e8	Issue #27441 Fix: Bug in updating ModuleDict & ParameterDict (#27814 ) Summary: Fix a bug in `nn.ModuleDict.update` and `nn.ParameterDict.update` when passing another same dictionary as input. Related issue: [Issue https://github.com/pytorch/pytorch/issues/27441](https://github.com/pytorch/pytorch/issues/27441) Pull Request resolved: https://github.com/pytorch/pytorch/pull/27814 Differential Revision: D21518099 Pulled By: ezyang fbshipit-source-id: 9e6bb6fcc26c8070e137e2e52c65f69a1fcaab37	2020-05-14 08:01:41 -07:00
Jeff Daily	138769b1b8	[ROCm] add exact_dtype=False to bfloat16 test (#38381 ) Summary: CC rohithkrn ezyang xw285cornell Fixes - TestNNDeviceTypeCUDA.test_activations_bfloat16_cuda - TestNNDeviceTypeCUDA.test_pooling_bfloat16_cuda - TestNNDeviceTypeCUDA.test_softmax_bfloat16_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/38381 Differential Revision: D21549636 Pulled By: ezyang fbshipit-source-id: acb290c57eff4077b040a696267ecde613f0a433	2020-05-13 08:48:18 -07:00
Vitaly Fedyunin	57d01be92b	Replacing assertEqual with assertEqualIgnoreType wherever types missmatch (#38102 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38102 Test Plan: Imported from OSS Differential Revision: D21477060 Pulled By: VitalyFedyunin fbshipit-source-id: 25e0fd837ca9bfccf0ce994c80f7790c894096d4	2020-05-09 14:48:55 -07:00
Simon Layton	c879c6fb98	Vectorize non-persistent Softmax kernels (#36485 ) Summary: Add read/write vectorization to non-persistent softmax kernels only. At this point launch logic has minimal changes, and `ILP=vectorization=2` is always used (the code can handle other values, but `ILP=2` has been the most consistent performer). Dispatch to persistent / non-persistent kernels is unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36485 Differential Revision: D21477775 Pulled By: ngimel fbshipit-source-id: 9ff7fd243695d7bbf4121390085b64db0bbdef35	2020-05-08 15:20:33 -07:00
Ailing Zhang	9232356e5f	remove uses of type() and type_as() part 1. (#38029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38029 Differential Revision: D21468523 Pulled By: ailzhang fbshipit-source-id: 14b7185d43eb03f630cfaa2d70e02d637ff8551b	2020-05-08 08:16:24 -07:00
Alban Desmaison	5e83a13e14	stop creating integer type Tensors that require gradients (#37789 ) Summary: Fix https://github.com/pytorch/pytorch/issues/37680 Makes two changes: - Add `argmin`, `argmax` and `argsort` to the list of non-differentiable functions to prevent them from generating outputs that requires_grad. - Add a check to make sure we don't add such functions to the codegen by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37789 Differential Revision: D21389201 Pulled By: albanD fbshipit-source-id: 6a7617e389e893f6f813d50f02700d32300b1386	2020-05-07 15:08:35 -07:00
Sharvil Nanavati	594b33ea10	Add support for non-persistent buffers. (#37191 ) Summary: Issue: https://github.com/pytorch/pytorch/issues/18056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37191 Differential Revision: D21428373 Pulled By: albanD fbshipit-source-id: a7d367bafb95137e1bc380178b82b08eff5d5a5a	2020-05-07 06:52:31 -07:00
rohithkrn	e3934dfae8	[ROCm] Enable bfloat16 for ops in BERT model (#37634 ) Summary: Enables bfloat16 type for ops present in BERT model. Enabled relevant unit tests. ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/37634 Differential Revision: D21413957 Pulled By: ezyang fbshipit-source-id: 19309fe46b4a2f07922bf5b32fee2066df514aeb	2020-05-05 21:24:56 -07:00
Jianyu Huang	fd05debbcd	[TS][easy] Typo Fix (#37773 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37773 As Title says ghstack-source-id: 103385174 Test Plan: CI Reviewed By: dmudiger Differential Revision: D21374951 fbshipit-source-id: a2fc48b931f0cecbc8a995bf4b4ace30a8eb0d70	2020-05-04 10:41:07 -07:00
Kimish Patel	df31ddbd98	Add channel shuffle op fp32 + quantized. (#36815 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36815 Pytorch does not have native channel shuffle op. This diff adds that for both fp and quantized tensors. For FP implementation is inefficient one. For quantized there is a native QNNPACK op for this. ghstack-source-id: 103267234 Test Plan: buck run caffe2/test:quantization -- quantization.test_quantized.TestQuantizedOps.test_channel_shuffle X86 implementation for QNNPACK is sse2 so this may not be the most efficient for x86. Reviewed By: dreiss Differential Revision: D21093841 fbshipit-source-id: 5282945f352df43fdffaa8544fe34dba99a5b97e	2020-05-01 10:07:15 -07:00
Michela Paganini	d37a4861b8	Explicit attribute setting for pruning and weight_norm upon reparam removal (#34170 ) Summary: To address one of the problems with RNNs that emerged in https://github.com/pytorch/pytorch/issues/33618, I modified the `remove` methods in `torch.nn.utils.prune` and `torch.nn.utils.weight_norm` to make an explicit call to `setattr`, which, in `rnn.py` directly modifies `_flat_weights` (https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/rnn.py#L96) to include the new element. This is important so that `_flat_weights` can reflect the presence of the `Parameter` after the (pruning or weight norm) reparametrization is removed. Without this, the weight in `_flat_weights` would remain a tensor, as originally set by the reparametrization. Simple testing is added, which depends on the current naming scheme for the LSTM module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34170 Differential Revision: D21265965 Pulled By: mickypaganini fbshipit-source-id: 29de4a6b17052d42ccfe67c8560b7f83c20fd09d	2020-04-29 09:01:59 -07:00
Xiao Wang	805c417ec9	Implement avg_pool2d kernel for channels_last (#35855 ) Summary: Implement avg_pool2d for channels_last. This will close https://github.com/pytorch/pytorch/issues/34996. Performance compared with avg_pool2d contiguous can be found at `ed6617c6bc/avg-pool2d-channels-last/avg-pool2d-naive.ipynb` cc csarofeen ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/35855 Differential Revision: D21187360 Pulled By: VitalyFedyunin fbshipit-source-id: b654b56168bc3982be306b634c7ed2f92018a9e5	2020-04-27 11:06:10 -07:00
Ryad ZENINE	a08a9f3b82	Enable uint8 upsampling 2 (#35029 ) Summary: Hi everyone, This is a supper small PR to enable `unit8` support for `nearest` up-sampling in `cpu` and `cuda`. This works enables us to move forward with the support of 'uint8' images in 'torchvision`. See impacted issues : https://github.com/pytorch/vision/issues/1375 https://github.com/pytorch/vision/issues/1179#issuecomment-558197607 Note: I wanted to add a unit test to ensure we have the expected behavior. I could not locate the `upsampling` unit tests for `nearest`. I can add the test if you point me to the right location. Thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/35029 Reviewed By: cpuhrsch Differential Revision: D21227144 Pulled By: fmassa fbshipit-source-id: 33c4b5188dedd8f7f872e9d797e2a9b58ee7315c	2020-04-27 10:25:10 -07:00
anjali411	4f3946a89b	Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#37193 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057 Partially resolves: https://github.com/pytorch/pytorch/issues/36671 ``` >>> 2j / torch.tensor([4], dtype = torch.complex64) tensor([(0.0000+0.5000j)], dtype=torch.complex64) >>> 1 / torch.tensor(3+4j) tensor((0.1200-0.1600j), dtype=torch.complex64) ``` rdiv is more generally broken for all dtypes because it doesn't promote the types properly eg. ``` >>> 1 / torch.tensor(2) tensor(0) >>> 2j / torch.tensor(4) tensor(0) ``` so that issue should be fixed in a separate PR Adding CPU acc types for complex Added cumsum, cumprod for complex dtypes Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes Old PR - https://github.com/pytorch/pytorch/pull/36747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37193 Differential Revision: D21229373 Pulled By: anjali411 fbshipit-source-id: 8a086136d8c10dabe62358d276331e3f22bb2342	2020-04-24 15:05:50 -07:00
Gao, Xiang	438aed63a1	Fix prelu_backward TensorIterator split (#36134 ) Summary: We should have ```C++ for (auto& sub_iter : iter.with_32bit_indexing()) { launch_prelu_cuda_backward_share_weights_kernel(sub_iter, weight_data); } ``` But I mistakenly wrote it as ```C++ for (auto& sub_iter : iter.with_32bit_indexing()) { launch_prelu_cuda_backward_share_weights_kernel(iter, weight_data); } ``` in my previous PR. Which leads to infinite recursion on it. I found this bug when working on https://github.com/pytorch/pytorch/pull/34004 I also add a `TORCH_INTERNAL_ASSERT_DEBUG_ONLY` to test for this. Besides, the caller is already guaranteed contiguous, so we don't need to handle no-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36134 Differential Revision: D21187542 Pulled By: VitalyFedyunin fbshipit-source-id: 0fafdd7b672bf89fcaa2b42e08b7d41ade7e6bcb	2020-04-23 10:42:20 -07:00
ashishfarmer	355cafde26	[ROCm] Don't use MIOpen for tensors with more than INT_MAX number of elements (#37110 ) Summary: This pull request extends the fallback implemented in https://github.com/pytorch/pytorch/issues/31383 to not use MIOpen for tensors where number of elements in a tensor exceeds INT_MAX. The PR also enables the corresponding test in TestNN cc: ezyang jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/37110 Differential Revision: D21196336 Pulled By: ezyang fbshipit-source-id: 25fd80308a0e2f7941c249735674ebc85d3fd39e	2020-04-22 21:20:53 -07:00
Ailing Zhang	efcbcca454	Revert D21138687: [pytorch][PR] Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex Test Plan: revert-hammer Differential Revision: D21138687 Original commit changeset: ad3602ccf86c fbshipit-source-id: 69eb031c1a7c3d5e4b9f4241fbdada8d5980535d	2020-04-22 14:49:45 -07:00
David Reiss	e75fb4356b	Remove (most) Python 2 support from Python code (#35615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615 Python 2 has reached end-of-life and is no longer supported by PyTorch. Now we can clean up a lot of cruft that we put in place to support it. These changes were all done manually, and I skipped anything that seemed like it would take more than a few seconds, so I think it makes sense to review it manually as well (though using side-by-side view and ignoring whitespace change might be helpful). Test Plan: CI Differential Revision: D20842886 Pulled By: dreiss fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed	2020-04-22 09:23:14 -07:00
anjali411	25eb250d77	Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#36747 ) Summary: Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057 Partially resolves: https://github.com/pytorch/pytorch/issues/36671 ``` >>> 2j / torch.tensor([4], dtype = torch.complex64) tensor([(0.0000+0.5000j)], dtype=torch.complex64) >>> 1 / torch.tensor(3+4j) tensor((0.1200-0.1600j), dtype=torch.complex64) ``` rdiv is more generally broken for all dtypes because it doesn't promote the types properly eg. ``` >>> 1 / torch.tensor(2) tensor(0) >>> 2j / torch.tensor(4) tensor(0) ``` so that issue should be fixed in a separate PR Adding CPU acc types for complex Added cumsum, cumprod for complex dtypes Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/36747 Differential Revision: D21138687 Pulled By: anjali411 fbshipit-source-id: ad3602ccf86c70294a6e71e564cb0d46c393dfab	2020-04-22 08:52:41 -07:00
Guanheng Zhang	b607c83a26	Add support for bool/byte `attn_mask` tensor in MultiheadAttention/Transformer modules (#33763 ) Summary: Add the support to accept both float, byte, and bool tensors for `attn_mask`. No breakage is expected. - If a bool tensor is provided, positions with `True` are not allowed to attend while `False` values will be unchanged. - if a byte tensor is provided, it will be converted to bool tensor. Positions with non-zero are not allowed to attend while zero values will be unchanged. - If a float tensor is provided, it will be added to the attention weight. Note: the behavior of the float mask tensor is slightly different from the first two options because it is added to the attention weight, rather than calling `masked_fill_` function. Also, converting a byte tensor to bool tensor within `multi_head_attention_forward` causes extra overhead. Therefore, a bool mask is recommended here. For `key_padding_mask`: - if a bool tensor is provided, it will be converted to bool tensor. The positions with the value of `True` will be ignored while the position with the value of `False` will be unchanged. - If a byte tensor is provided, the positions with the value of non-zero will be ignored while the position with the value of zero will be unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33763 Differential Revision: D20925358 Pulled By: zhangguanheng66 fbshipit-source-id: de174056be183cdad0f3de8024ee0a3c5eb364c9	2020-04-21 14:06:59 -07:00
Di Wu	54f265249c	Optimize grouped Conv3d performance (#36355 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36355 Resolving issue in https://github.com/pytorch/pytorch/issues/36155, by: - supporting grouped conv3d in ```slow_conv3d``` - adding a fast path in ```__convolution``` to call ```slow_conv3d``` when running grouped conv3d on CPU - bypassing unfolding when kernel_size = 1 Test Plan: Added the following test cases in test_nn.py, testing both forward and backward: - test_Conv3d_groups_nobias - test_Conv3d_groups_wbias - test_Conv_1x1 Imported from OSS Differential Revision: D20957073 fbshipit-source-id: 29afd1e6be8c484859eaedd51463954e2fdccc38	2020-04-21 11:17:07 -07:00
Yuxin Wu	ff435a0e6b	[pytorch] add test for empty tensor support in nn.Linear (#36983 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36983 fix https://github.com/pytorch/pytorch/issues/34202 it seems to be fixed now but without a test Test Plan: sandcastle Differential Revision: D21149623 fbshipit-source-id: 109f8e75a0826541ec7beb1920d5a38e0e826899	2020-04-21 01:15:26 -07:00
JackCaoG	cdc1ca040a	Enable test_hardsigmoid_grad_xla on pytorch side (#36967 ) Summary: hardsigmoid_backward is implemented in xla side so the test will not error out but is really slow due to a lot of recompile. Enable the test on the pytorch side but skip it in xla side so xla can control when to enable the test Pull Request resolved: https://github.com/pytorch/pytorch/pull/36967 Differential Revision: D21149113 Pulled By: ailzhang fbshipit-source-id: fc337622fafa7be9cff2631de131980ea53adb8d	2020-04-20 21:21:59 -07:00
rohithkrn	742d9796bc	[ROCm] Enable wrongly skipped tests on CPU on ROCm (#36968 ) Summary: `skipIfRocm` skips the test on ROCm regardless of device type [CPU or GPU]. `skipCUDAIfRocm` skips only on GPU on ROCm and runs the test on CPU. ezyang iotamudelta Pull Request resolved: https://github.com/pytorch/pytorch/pull/36968 Differential Revision: D21149721 Pulled By: ezyang fbshipit-source-id: 361811b0b307f17193ad72ee8bcc7f2c65ce6203	2020-04-20 21:15:58 -07:00
linziyi	1341ea4802	Fix MaxPool3d CUDA backward incorrect results for non-square output (#36820 ) Summary: In the CUDA version of max_pool3d backward, function `max_pool3d_with_indices_backward_out_frame` is defined with args as `..., oheight, owidth, ...` but called with `..., owidth, oheight, ...`. As a result gradients are not fully calculated along the longer dimension due to insufficient grid size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36820 Differential Revision: D21120078 Pulled By: ngimel fbshipit-source-id: d061726647a4a45d45d5c1a00f2f1cf2745726a8	2020-04-19 18:05:02 -07:00
Brian Vaughan	54ed6fd3ee	Use both absolute and relative tolerance in testing (#34258 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258 This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly. Test Plan: Imported from OSS Differential Revision: D21110255 Pulled By: nairbv fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e	2020-04-19 06:16:49 -07:00
ashish	9df9aef9b9	[ROCm] Use float datatype for RNN test for MIOpen (#36772 ) Summary: This pull request changes the datatype for `test_RNN_cpu_vs_cudnn_no_dropout` on ROCm testing to float. Currently MIOpen RNN does not support double datatype, so using only double would not run this test using MIOpen. To correctly test PyTorch RNN operator using MIOpen, we would need to test it using float tensors and module. The changes in this PR addresses the comments in https://github.com/pytorch/pytorch/issues/34615 ezyang iotamudelta Pull Request resolved: https://github.com/pytorch/pytorch/pull/36772 Differential Revision: D21089533 Pulled By: ezyang fbshipit-source-id: b5781e4ca270d64c6b949b3f0436e7b4eb870e27	2020-04-17 09:14:06 -07:00
Gregory Chanan	4c666d42ff	Handle log_sigmoid(out=) properly. (#36736 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36736 Fixes: https://github.com/pytorch/pytorch/issues/36499 Changes: 1) Moves some bindings from LegacyNNDefinitions to Activation so all of log_sigmoid lives together 2) Properly handle non-contiguous / incorrectly sized out parameters to log_sigmoid. This is done by copying from a buffer if necessary. 3) Require that the internal buffer (different from 2)) is contiguous. This should always be the case because it's always created internally. 4) Adds a test Test Plan: Imported from OSS Differential Revision: D21070934 Pulled By: gchanan fbshipit-source-id: 94577313c32d1ef04d65c1d6657598304a39fe6e	2020-04-17 08:27:57 -07:00
ashish	609b6875f9	Enable test_upsamplingNearest2d_launch_fail on ROCm (#36624 ) Summary: The test case exercised in `test_upsamplingNearest2d_launch_fail` will fail on ROCm. The max. grid size per dimension for ROCm are 4294967295(0xffffffff), which is why the tensor dims in `test_upsamplingNearest2d_launch_fail` must give correct results. This PR adds that test case `test_upsamplingNearest2d_launch_rocm` for ONLY ROCm scenario which is essentially the same as `test_upsamplingNearest2d_launch_fail` without an expected failure decorator ezyang iotamudelta Pull Request resolved: https://github.com/pytorch/pytorch/pull/36624 Differential Revision: D21050330 Pulled By: ezyang fbshipit-source-id: d7370c97eaab98f382f97052ed39cc168a3bfa71	2020-04-15 16:29:53 -07:00
Vasiliy Kuznetsov	3c8921b747	hardswish: add backards pass test (#36420 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36420 Adds a unit test for hardswish backward pass Test Plan: Unit test passes on cpu and cuda Imported from OSS Differential Revision: D20994100 fbshipit-source-id: 579df709cc2d92fce3b9a0eeb6faeb9fe8d2f641	2020-04-15 10:17:13 -07:00
Vasiliy Kuznetsov	16e90eba59	hardsigmoid: add cuda kernels (#36351 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36351 Adds CUDA kernels for hardsigmoid, to enable its use in training. Note: the update to the cpu backward pass is to keep the cpu vs cuda logic consistent, no change in functionality. Test Plan: add CI for the forward pass run this for the backward pass: https://gist.github.com/vkuzo/95957d365600f9ad10d25bd20f58cc1a Imported from OSS Differential Revision: D20955589 fbshipit-source-id: dc198aa6a58e1a7996e1831f1e479c398ffcbc90	2020-04-15 10:15:49 -07:00
musikisomorphie	cdfefa77a3	PR for double backwards of nn.Fold and nn.Unfold (issue #33452 ) (#36379 ) Summary: soumith ezyang albanD After lots of experiments, I didn't manage to directly print the gradients of Fold/Unfold_backward (let me know if I am wrong). Thus, in my testing codes, I compare the gradients of Fold/Unfold_backward implicitly by comparing the gradients of its following operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36379 Differential Revision: D21040646 Pulled By: ezyang fbshipit-source-id: dafdbfe2c7b20efa535402c7f81fce5c681fce2f	2020-04-15 10:10:05 -07:00
Wanchao Liang	3526627f46	Use unittest assertWarns instead (#36411 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36411 This PR remove pytorch specific defined assertwarns and use the unit test one, also format some tests Test Plan: Imported from OSS Differential Revision: D20998159 Pulled By: wanchaol fbshipit-source-id: 1280ecff2dd293b95a639d13cc7417fc819c2201	2020-04-13 15:56:42 -07:00
albanD	9497b21e63	Grad input padding support for dilation argument (#33872 ) Summary: Fix https://github.com/pytorch/pytorch/issues/16012 It replaces https://github.com/pytorch/pytorch/pull/20684 that has gone stale and simply adds tests on top of it. These calls used to crash, they now work and return the same value as the backward using the autograd engine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33872 Differential Revision: D20148360 Pulled By: albanD fbshipit-source-id: 1113f1a25be238570fa8900fc1be658b61a47802	2020-04-09 11:09:55 -07:00
Xiao Wang	301be851ef	Fix grid_sample out of boundary when grid contains large numbers (#35506 ) Summary: This PR would fix https://github.com/pytorch/pytorch/issues/35202, fix GPU part of https://github.com/pytorch/pytorch/issues/24823, be related to https://github.com/pytorch/pytorch/issues/24870. Here is the origin of this problem. 1. Like those in https://github.com/pytorch/pytorch/issues/35202, with large numbers in grid like `grid.min() == -10059144 grid.max()==67680944`; or `nan, inf, 1.0E20` in https://github.com/pytorch/pytorch/issues/24823, `4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cu (L309-L321)` `ix, iy` will be unnormalized to very large numbers, exceed the bound of INT_MAX. Then, those `ix_nw, iy_nw` variables will be cast to INT_MAX, and some other variables with "+1" will be INT_MIN. 2. However, these INT_MAX, INT_MIN should not big problems, because `4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cu (L358-L362)` `4d39aeec27/aten/src/ATen/native/cuda/GridSampler.cuh (L202-L205)` these `within_bounds_2d` functions are supposed to guard the if-statement, prevent the illegal memory access, and leave those output values as zero (padding_modes='zeros'). 3. Now here comes the problem, `within_bounds_2d` is set to "inline". We found that those `+1` statement and `>=0` statement may cause compiler to "optimize" the code, that is: ```cpp int B = something; int a = something; int b = a + 1; bool r = (b >= 0 && b < B); ``` will be compiled into assembly code like ```cpp int B = something; int a = something; bool r1 = (a > -2) int b = a + 1; bool r2 = (b < B); bool r = r1 && r2; ``` This looks nice, but when a = INT_MAX, `a+1` causes Undefined Behavior. Typically, we get b = INT_MIN, then the boolean result from compiled assembly will be true. The `within_bounds_2d` no longer guards us from the illegal memory access. 4. There could be different ways to fix this bug. For example, we may set all of the "ix_nw, iy_nw" values to `int64_t`. That would be a potential performance issue, and doesn't prevent those examples in https://github.com/pytorch/pytorch/issues/24823 with 1E20 in grid. One minimal fix that I found is to restrict `within_bounds_2d` from being inlined. Thus, compiler won't optimize those `a+1` and `a>=0` code together. I did a short performace test, just to make sure this forced noinline solution won't cause regression. The performance script can be found at `a6f8bce522/grid-sample/grid-sample.ipynb`. For this `__attribute__((noinline))` macro, I have tested that on nvcc, and there was no problem. I'm not sure if that also works on clang. cc csarofeen ptrblck ngimel bnehoran zasdfgbnm SsnL Pull Request resolved: https://github.com/pytorch/pytorch/pull/35506 Differential Revision: D20799304 Pulled By: ngimel fbshipit-source-id: fc70289b35039fad954908a990ab0a2f16fbfcb2	2020-04-01 14:38:30 -07:00
Nik Ved	35cdb78522	Make kl_div accept target in log space (#34586 ) Summary: Fixes [32520](https://github.com/pytorch/pytorch/issues/32520), implements [34536](https://github.com/pytorch/pytorch/issues/34536). Here are some benchmarks: ```python import torch import torch.nn.functional as F from IPython import get_ipython ipython = get_ipython() torch.set_num_threads(1) for d in [5, 10, 20, 50, 100, 1000]: i = torch.rand(d, d) t = torch.rand(d, d) print(f"Size: {d}x{d}") ipython.magic("timeit F.kl_div(i, t, reduction='none', log_target=False)") ipython.magic("timeit F.kl_div(i, t.log(), reduction='none', log_target=True)") ``` Output: ``` Size: 5x5 16 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 8.24 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 10x10 16.7 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 8.7 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 20x20 17.7 µs ± 47.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 9.7 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 50x50 23.6 µs ± 60.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 15 µs ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Size: 100x100 42.8 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 34 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) Size: 1000x1000 3.9 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 3.45 ms ± 364 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/34586 Differential Revision: D20652726 Pulled By: ezyang fbshipit-source-id: 480697b4cd01341bbeee7514a8b812705a0600ea	2020-04-01 12:26:58 -07:00
mpariente	79054495d3	(Fixes #33934 ) Fix AttributeError for nn.Module's properties (#34324 ) Summary: As described in https://github.com/pytorch/pytorch/issues/33934, the current attribute error in `nn.Module`'s properties are wrong. ```python from torch import nn class MyModule(nn.Module): property def something(self): hey = self.unknown_function() return hey model = MyModule() print(model.something) ``` This raises `AttributeError: 'MyModule' object has no attribute 'something'` when what we want is `AttributeError: MyModule instance has no attribute 'unknown_function'`. This fixes this issue and will make properties much easier to debug ! Pull Request resolved: https://github.com/pytorch/pytorch/pull/34324 Differential Revision: D20645563 Pulled By: ezyang fbshipit-source-id: 130f861851bdbef43803569a5ce9e24d2b942179	2020-03-26 07:43:21 -07:00
Will Feng	2dc2933358	Move NewModuleTest and NewCriterionTest from test_nn.py to common_nn.py (#35189 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35189 Test Plan: Imported from OSS Differential Revision: D20588197 Pulled By: yf225 fbshipit-source-id: 5a28159b653895678c250cbc0c1ddd51bc7a3123	2020-03-24 14:05:45 -07:00
Enealor	8bcedf7da2	Adds truncated normal initializer (#32397 ) Summary: This adds the `trunc_normal_` function to `torch.nn.init` which allows for modifying tensors in-place to values drawn from a truncated normal distribution. I chose to use the inverse CDF method to implement this. I have included the appropriate code in `test_nn.py` for verifying that the values are from the correct distribution. Reasons I chose this method: 1. Easily implemented to operate on memory in place, as the other initializers are. 1. No resampling delays 1. This method's main weakness is unlikely to be an issue. While the inverse CDF method can fail to generate the correct distribution when `b < mean` or `mean < a`, I expect users will choose `a` and `b` so that `a < mean < b`. This method is extremely effective in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32397 Differential Revision: D20550996 Pulled By: ezyang fbshipit-source-id: 298a325043a3fd7d1e24d266e3b9b6cc14f81829	2020-03-20 10:29:05 -07:00
Xiao Wang	fa5bc9fa2e	Fix problem in NHWC max_pool2d; use accumulate type in NHWC max_pool2d (#34934 ) Summary: This PR would fix https://github.com/pytorch/pytorch/issues/34736. Both code snippet in that issue can now execute normally. More tests are also added. This PR is a follow-up on https://github.com/pytorch/pytorch/issues/34519, where one variable was mistakenly missed when updating the max_pool2d kernel. This PR also uses accumulate type of scalar_t in the backward kernel, which resolves the numerical precision issue when stride < kernel_size on fp16. cc csarofeen ptrblck jjsjann123 VitalyFedyunin ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/34934 Differential Revision: D20512062 Pulled By: VitalyFedyunin fbshipit-source-id: a461ebbb3e3684aa183ae40e38d8f55bb6f4fee1	2020-03-18 08:32:10 -07:00
Kimish Patel	7a3cf67fd8	Implement channels last upsample2d/3d forward pass kernel. (#34597 ) Summary: Thi PR implement channel last upsampling nearest for 2D/3D. This is supposed to be faster, plus, avoids converting formats going in and out of operator. Will post benchmarking numbers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34597 Test Plan: python test/test_nn.py TestNN.test_upsamplingNearest3d_channels_last Differential Revision: D20390583 Pulled By: kimishpatel fbshipit-source-id: e0162fb97604a261887f38fc957d3f787c80954e	2020-03-17 13:04:42 -07:00
Nikita Shulga	b1dbe33056	Skip `TestNN.test_spectral_norm_load_state_` if PyTorch is compiled w… (#34686 ) Summary: …ithout lapack LAPACK is needed for `at::svd``, which is called from `pinverse()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/34686 Test Plan: CI + local run Differential Revision: D20442637 Pulled By: malfet fbshipit-source-id: b3531ecc1197b0745ddcf50febb7fb4a7700d612	2020-03-13 11:36:33 -07:00
X Wang	40eff454ce	Fix max_pool2d NHWC for large tensors; fix incorrect use of cudaGetLastError() (#34519 ) Summary: This PR would fix https://github.com/pytorch/pytorch/issues/33988 and fix https://github.com/pytorch/pytorch/issues/34083. Previously, the max_pool2d_nhwc kernels used a shared memory with size proportional to the tensor size (c \* h \* w). When the tensor size is too large, the kernel launch fails. This PR follows the guidance in AdaptiveAvgPool2d_nhwc by increasing the number of grid_x with split in "C" dimension. With that change, there will be a maximum limit in the shared memory size (which is less than 48 kb) regardless of tensor size. A benchmark can be found at [here](`0b98146089/max-pool2d/max-pool2d.ipynb`). TL;DR barely any performance drop is found. cc csarofeen ptrblck jjsjann123 VitalyFedyunin Pull Request resolved: https://github.com/pytorch/pytorch/pull/34519 Differential Revision: D20388848 Pulled By: VitalyFedyunin fbshipit-source-id: 9454f385f9315afaab4a05303305578bbcd80b87	2020-03-13 11:28:49 -07:00
rohithkrn	2f32b92763	[ROCm] Enable BFloat16 type for EmbeddingBag ops et al (#34630 ) Summary: This PR enables bfloat16 type for - Embedding, Index, Sigmoid Ops used in [DLRM](https://github.com/facebookresearch/dlrm) - Miscellaneous ops like comparison ops, arange op used in unit tests - Rename types list with the pattern `*_with_bfloat16` in `test_torch.py` to avoid confusion iotamudelta ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/34630 Differential Revision: D20405093 Pulled By: ezyang fbshipit-source-id: aa9538acf81b3a5a9a46ce5014529707fdf25687	2020-03-12 11:30:33 -07:00
rohithkrn	29b673392f	[ROCm] Enable BFloat16 type for loss functions and few misc ops required for resnet50 (#34469 ) Summary: This PR enables bfloat16 type for loss criterion ops(and the ops they depend on) and few miscellaneous ops required to train resnet50. iotamudelta ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/34469 Differential Revision: D20348856 Pulled By: ezyang fbshipit-source-id: 0a8f06c2169cfa3c9cf319120e27150170095f6c	2020-03-10 08:39:07 -07:00
Johannes M Dieterich	2c1a302d6a	[ROCm] Enable double __shfl_down (#34103 ) Summary: This allows us to enable some double-based pdist tests running into accrued error from casting down to float previously. Addresses https://github.com/pytorch/pytorch/issues/33128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/34103 Differential Revision: D20343279 Pulled By: ezyang fbshipit-source-id: a2da768259fab34ef326976283b7a15bebbbb979	2020-03-09 16:23:56 -07:00
Xiang Gao	96ca06cfce	Add nhwc memory format test for dropout (#34379 ) Summary: cc: ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/34379 Differential Revision: D20310118 Pulled By: ngimel fbshipit-source-id: a9bafd6b8fbcb57443e22181cf6bd9879b6f6051	2020-03-06 15:43:21 -08:00
Xiang Gao	37dfc6c498	Reenable large conv tests (#34259 ) Summary: Please merge after https://github.com/pytorch/pytorch/pull/33073 With that PR, we are now trying different algorithms when OOM, so hopefully there will be some algo working at low memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/34259 Differential Revision: D20310094 Pulled By: ngimel fbshipit-source-id: bccd8162bd06a0e54ac6f42a7fd9a5b766f92cd7	2020-03-06 15:36:54 -08:00
Pavel Belevich	35b6d2945d	Tensor.random_ check that from and to are in tensor dtype bounds (#34033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34033 Test Plan: Imported from OSS Differential Revision: D20182414 Pulled By: pbelevich fbshipit-source-id: 3704570ead7de169ce13c81164be0aff0806fb46	2020-03-06 07:22:47 -08:00
rohithkrn	e907128caf	[ROCm] Enable BFloat16 type for pooling ops (#34166 ) Summary: This PR enables bfloat16 type for pooling ops on ROCm. Also adds bfloat16 implementation of atomicAdd since pooling ops use it. Note: Changes in the lambda function blocks is only indentation as it is now wrapped inside `AT_SKIP_BFLOAT16_IF_NOT_ROCM` macro. iotamudelta ezyang bddppq Pull Request resolved: https://github.com/pytorch/pytorch/pull/34166 Differential Revision: D20263421 Pulled By: ezyang fbshipit-source-id: 3f4199ec57522e638ec29f45e22c6ec919b7816d	2020-03-05 11:20:54 -08:00
Jie	e54b8e1a47	[CUDNN NHWC CONVOLUTION] Re-stride input tensors for wgrad in cudnn_convolution (#33784 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33784 Differential Revision: D20127485 Pulled By: VitalyFedyunin fbshipit-source-id: 9d893ffe7ff9499e7e9a7e8bed720e9441d1018e	2020-03-02 10:05:59 -08:00
Pavel Belevich	095de1e872	Migrate `random_` from the TH to Aten (CPU and CUDA) (#33663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33663 Test Plan: Imported from OSS Differential Revision: D20056350 Pulled By: pbelevich fbshipit-source-id: f9859b79ffdec70c48d6ee3ec70fd6fad593a9f5	2020-02-27 05:05:42 -08:00
Barak Nehoran	f597ac6efc	Fix grid_sample gradients at image borders (#32829 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/23925 This fixes the incorrect gradients returned by `F.grid_sample` at image borders under `"border"` and `"reflection"` padding modes. At nondifferentiable points, the choice of which gradient to return among its super- or subgradients is rather arbitrary and generally does not affect training. Before this change, however, a bug in the code meant that the gradient returned at the exact borders was not selected from among the super- or subgradients. The gradient is now set to zero at the borders, which is a defensible choice for both the `"border"` and `"reflection"` padding modes: * For `"border"` padding, this effectively means that the exact borders of the image are now considered out of bounds, and therefore receive zero gradient. * For `"reflection"` padding, this effectively treats the exact borders as extrema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32829 Differential Revision: D20118564 Pulled By: soumith fbshipit-source-id: ef8571ff585be35ab1b90a922af299f53ab9c095	2020-02-26 10:10:42 -08:00
Natalia Gimelshein	a9cef05f5d	improve EmbeddingBag performance on cuda (#33589 ) Summary: This PR improves performance of EmbeddingBag on cuda by removing 5 kernel launches (2 of those are synchronizing memcopies). - 2 memcopies are checking values of offsets[0] and offsets[-1] to be in expected range (0 for the former, less than number of indices for the latter). It seems strange to check only those 2 values, if users are providing invalid offsets, invalid values can be anywhere in the array, not only the first and last element. After this PR, the checks are skipped on cuda, the first value is forced to 0, if the last value is larger than expected, cuda kernel will assert. It is less nice than ValueError, but then again, the kernel could have asserted if other offset values were invalid. On the cpu, the checks are moved inside the cpu implementation from functional.py, and will throw RuntimeError instead of ValueError. - 3 or 4 initializations (depending on the mode) of the output tensors with .zeros() are unnecessary, because every element of those tensors is written to, so their data can be uninitialized on the start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33589 Reviewed By: jianyuh Differential Revision: D20078011 Pulled By: ngimel fbshipit-source-id: 2fb2e2080313af64adc5cf1b9fc6ffbdc6efaf16	2020-02-24 21:37:34 -08:00
Pavel Belevich	312627a7c3	Revert D19776613: Migrate `random_` from the TH to Aten (CPU) Test Plan: revert-hammer Differential Revision: D19776613 Original commit changeset: a8d262bccf5f fbshipit-source-id: 36389ffa3d8377743f55f97221d7a7ee25a409f6	2020-02-22 08:15:27 -08:00
Pavel Belevich	d971007c29	Migrate `random_` from the TH to Aten (CPU) (#32534 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32534 Fixes #24752 Fixes #32510 Test Plan: Imported from OSS Differential Revision: D19776613 Pulled By: pbelevich fbshipit-source-id: a8d262bccf5f2807f6125c83080aa16d77491b19	2020-02-21 16:13:58 -08:00
Hong Xu	e2a9ea0f72	Ensure that lambda is no less than zero in softshrink (#33201 ) Summary: Softshrink is ill-defined when `lambda < 0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33201 Differential Revision: D19899571 Pulled By: ezyang fbshipit-source-id: ac0dd8edea3435810a76a3a88152f83a024c7859	2020-02-21 08:34:06 -08:00
Assaf Shocher	2c99ea8654	Dirac init compatibility with group convolutions (#32825 ) Summary: Initializing weights of group-conv with init.dirac_, and applying, previously resulted in an output that makes no sense: ``` x = torch.randn([1, 3, 3, 3]) print('input:\n', x) conv_layer = torch.nn.Conv2d(3, 3, 3, padding=1, groups=3, bias=False) torch.nn.init.dirac_(conv_layer.weight.data) print('\noutput (before this PR):\n',conv_layer(x)) input: tensor([[[[ 0.5369, -1.1428, 0.1031], [ 0.4638, -0.0854, -0.6553], [ 0.8321, -2.5926, -0.3214]], [[-0.2289, -0.0895, 0.4407], [ 1.2309, -1.2096, -1.5216], [-0.1798, 1.1694, 0.3469]], [[ 0.1905, 0.8095, 0.5490], [-0.4525, -0.4284, -0.1141], [ 1.1857, -0.9246, -0.5119]]]]) output (before this PR): tensor([[[[ 0.5369, -1.1428, 0.1031], [ 0.4638, -0.0854, -0.6553], [ 0.8321, -2.5926, -0.3214]], [[ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000]], [[ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000]]]], grad_fn=<MkldnnConvolutionBackward>) ```` This PR allows introducing groups to the initialization: ``` torch.nn.init.dirac_(conv_layer.weight.data, groups=3) print('output (after this PR):\n', conv_layer(x)) output (after this PR): tensor([[[[ 0.5369, -1.1428, 0.1031], [ 0.4638, -0.0854, -0.6553], [ 0.8321, -2.5926, -0.3214]], [[-0.2289, -0.0895, 0.4407], [ 1.2309, -1.2096, -1.5216], [-0.1798, 1.1694, 0.3469]], [[ 0.1905, 0.8095, 0.5490], [-0.4525, -0.4284, -0.1141], [ 1.1857, -0.9246, -0.5119]]]], grad_fn=<MkldnnConvolutionBackward>) ``` When out_channels is different than input_channels, it does the natural thing which is applying identity in each group separately: ``` x = torch.randn([1, 2, 3, 3]) print('input:\n', x) conv_layer = torch.nn.Conv2d(2, 4, 3, padding=1, groups=2, bias=False) torch.nn.init.dirac_(conv_layer.weight.data, groups=2) print('\noutput:\n', conv_layer(x)) input: tensor([[[[ 1.2205, -0.6608, 0.8640], [-0.5464, 1.1288, 1.4726], [-0.6693, 0.4000, -1.7613]], [[-0.8760, -0.8814, -0.4705], [ 0.6283, -0.5943, 0.6873], [-0.6852, 1.4723, 0.3325]]]]) output: tensor([[[[ 1.2205, -0.6608, 0.8640], [-0.5464, 1.1288, 1.4726], [-0.6693, 0.4000, -1.7613]], [[ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000]], [[-0.8760, -0.8814, -0.4705], [ 0.6283, -0.5943, 0.6873], [-0.6852, 1.4723, 0.3325]], [[ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000]]]], grad_fn=<MkldnnConvolutionBackward>) ``` Argument 'groups' defaults to 1 so it is backward compatible. Tests are modified to include cases of with groups>1 but also contain groups=1 cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32825 Differential Revision: D19859926 Pulled By: vincentqb fbshipit-source-id: 9dfdd24471ff14d79c442dfd28c1891aff812fdf	2020-02-18 09:00:12 -08:00
Vasil Khalidov	cfb4862673	[pytorch] correct input size check for GroupNorm (#33008 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33008 Corrects D19373507 to allow valid use cases that fail now. Multiplies batch size by the number of elements in a group to get the correct number of elements over which statistics are computed. Details: The current implementation disallows GroupNorm to be applied to tensors of shape e.g. `(1, C, 1, 1)` to prevent cases where statistics are computed over 1 element and thus result in a tensor filled with zeros. However, in GroupNorm the statistics are calculated across channels. So in case where one has an input tensor of shape `(1, 256, 1, 1)` for `GroupNorm(32, 256)`, the statistics will be computed over 8 elements and thus be meaningful. One use case is [Atrous Spatial Pyramid Pooling (ASPPPooling)](`791c172a33/torchvision/models/segmentation/deeplabv3.py (L50)`), where GroupNorm could be used in place of BatchNorm [here](`791c172a33/torchvision/models/segmentation/deeplabv3.py (L55)`). However, now this is prohibited and results in failures. Proposed solution consists in correcting the computation of the number of elements over which statistics are computed. The number of elements per group is taken into account in the batch size. Test Plan: check that existing tests pass Reviewed By: fmassa Differential Revision: D19723407 fbshipit-source-id: c85c244c832e6592e9aedb279d0acc867eef8f0c	2020-02-18 06:43:53 -08:00
Xiang Gao	55fa133cdc	Remove gpu_kernel_with_index (#33370 ) Summary: Although `gpu_kernel_with_index` might look like a quite general helper function at first look, it actually isn't. The problem is not only 32bit indexing, but something more fundamental: `TensorIterator` reorder dims and shapes, so if you have non-contiguous tensor such as `torch.empty(5, 5).t()` , the index won't be correct. Since the whole point of `TensorIterator` is to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts. Currently, the range factories are not failing on an `out=non_contiguous_tensor` is because it is so lucky that `has_internal_overlap` is stupid enough to return everything not contiguous as `TOO_HARD`. Since `gpu_kernel_with_index` is not general, we should move it from `Loops.cuh` to `RangeFactories.cu`. And since the kernel is so simple to implement, it makes no sense to use `TensorIterator` which goes through tons of unnecessary checks like `compute_dtypes`. `torch.range` is not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5). Benchmark: The device is GTX-1650, I don't have a good GPU at home. Code: ```python import torch print(torch.__version__) for i in range(100): torch.randn(1000, device='cuda') torch.cuda.synchronize() for i in range(15, 29): %timeit torch.arange(2 ** i, device='cuda'); torch.cuda.synchronize() ``` Before: ``` 1.5.0a0+c37a9b8 11.9 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 12.7 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 19.6 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 28.9 µs ± 923 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 48.4 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 85.7 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 162 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 312 µs ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 618 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.22 ms ± 9.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.45 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After: ``` 1.5.0a0+7960d19 11 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 12.4 µs ± 550 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 18.4 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 27.6 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 46.2 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 83.3 µs ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 307 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 603 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.4 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.77 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.51 ms ± 933 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/33370 Differential Revision: D19925990 Pulled By: ngimel fbshipit-source-id: f4a732fe14a5582b35a56618941120d62e82fdce	2020-02-17 17:15:04 -08:00
Pritam Damania	fd684cc312	Use torch.set_default_dtype in test_data_parallel and rename dtype2prec (#32962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32962 As per gchanan's comments on https://github.com/pytorch/pytorch/pull/30445, I've used `torch.set_default_dtype` in test_data_parallel instead of specifying dtype=torch.double everywhere. Also, renamed dtype2prec to dtype2prec_DONTUSE ghstack-source-id: 98388429 Test Plan: waitforbuildbot Differential Revision: D19714374 fbshipit-source-id: eb55bbca33881625636ba9ea6dd4cb692f25668e	2020-02-15 14:07:54 -08:00
rohithkrn	66ee4f1c81	[ROCm] Enable Bfloat16 type for activation and batch-norm Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32065 Differential Revision: D19728858 Pulled By: ezyang fbshipit-source-id: 8f828c558bfe6c5f43f476ff8a0f967341f8f351	2020-02-11 21:04:20 -08:00
davidriazati	74ce3a032c	Fix some bugs with zipfile serialization (#32244 ) Summary: Stacked PRs * #32958 - Make zip serialization the default * #32244 - Fix some bugs with zipfile serialization It includes the following changes: * Split up tests so that we can test both serialization methods * Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end) * Call `readinto` on a buffer if possible instead of `read` + a copy * Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine) ](https://our.intern.facebook.com/intern/diff/19418935/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/32244 Pulled By: driazati Reviewed By: eellison Differential Revision: D19418935 fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573	2020-02-05 15:32:14 -08:00
Natalia Gimelshein	e8581869f2	Properly update _flat_weights in RNN models (#32989 ) Summary: Resubmitting https://github.com/pytorch/pytorch/issues/32939 Should fix https://github.com/pytorch/pytorch/issues/32346 hopefully. Now when _flat_weights list is updated, None elements are appended to it if some weights are missing, subsequent setattr calls for the missing weights should repair _flat_weights and make it suitable to use in the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32989 Differential Revision: D19731952 Pulled By: ngimel fbshipit-source-id: 2118a19840491e7ab0fef15185fad982f42795a6	2020-02-05 11:53:41 -08:00
Ashkan Aliabadi	b0d5ce3848	Revert D19710990: [pytorch][PR] properly update _flat_weights in RNN modules Test Plan: revert-hammer Differential Revision: D19710990 Original commit changeset: c978c7519464 fbshipit-source-id: 8710bc2f4f1d01d9c93d038b59caf1e6859375dd	2020-02-04 14:35:55 -08:00
Jie	9e7c47644f	[NHWC CUDNN CONV]Update cudnn convolution memory_format behavior (#32482 ) Summary: 1. Allows both the memory_format of weight & input to dictate the output memory_format. 2. Provides utility function to recursively convert memory_format of Conv2d and ConvTranspose2d layers. This allows easy model conversion and ensures that lost memory_format through incompatible layers could be restored at Convolution-like layer, where significant performance boost is expected on later generation CUDA devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32482 Differential Revision: D19647903 Pulled By: VitalyFedyunin fbshipit-source-id: 62c96ff6208ff5e84fae1f55b63af9a010ad199a	2020-02-04 09:50:57 -08:00
Natalia Gimelshein	df71b3e23a	properly update _flat_weights in RNN modules (#32939 ) Summary: Should fix https://github.com/pytorch/pytorch/issues/32346 hopefully. Now when _flat_weights list is updated, `None` elements are appended to it if some weights are missing, subsequent `setattr` calls for the missing weights should repair _flat_weights and make it suitable to use in the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32939 Differential Revision: D19710990 Pulled By: ngimel fbshipit-source-id: c978c7519464e94beeffa9bc33b9172854a2f298	2020-02-03 18:27:00 -08:00
Sameer Deshmukh	5ca7bf453d	Tests for verifying behaviour of BatchNorm using 0-dim batch sizes. (#32384 ) Summary: The `BatchNorm*` part of the issue (see gh-12013) seems to have been fixed in the master branch and these tests would make it concrete. However I would appreciate comments on https://github.com/pytorch/pytorch/issues/12013#issuecomment-575871264 on whether the current behaviour is satisfactory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32384 Differential Revision: D19704154 Pulled By: ngimel fbshipit-source-id: 1bbbbf1ae1215a460b22cf26e6b263e518ecf60b	2020-02-03 16:58:23 -08:00
Charles Hofer	d03c9aaa05	Fix upsampling test case on ppc (#32786 ) Summary: Power and x86 are giving slightly different results when scaling images up using `torch.nn.functional.interpolate` and when using OpenCV's `resize`. This is causing `test_upsampling_not_recompute_scale_factor` to fail on Power, but not x86. This changes the expected value to what OpenCV on Power produces if the test case is running on Power as well. See https://github.com/pytorch/pytorch/issues/31915 ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/32786 Differential Revision: D19672053 Pulled By: ezyang fbshipit-source-id: 3497f852bdc6d782646773792f9107c857c7b806	2020-01-31 16:40:56 -08:00
Natalia Gimelshein	29fabb1fbc	make tests for empty inputs check zero parameter grads (#32820 ) Summary: Make batch norm with empty inputs return zero parameter gradients. Now batch norm, group norm and convolutions now return zero grads for parameters, so make tests check that. Fixes some bullet points in https://github.com/pytorch/pytorch/issues/12013 (interpolate is not fixed by this PR, is being fixed in other PRs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/32820 Differential Revision: D19651470 Pulled By: ngimel fbshipit-source-id: 96fdd085f9b0e98e91217dd2ac1f30f9c482b8be	2020-01-30 17:42:55 -08:00
root	0f0972051a	Cudnn bn size fix (#32763 ) Summary: Should fix https://github.com/pytorch/pytorch/issues/29744 by falling back to native batch norm implementation, if cudnn cannot execute the provided shape. Shape numbers were verified for cudnn 7.6.5.32 with tensor shapes: ```python # for spatial bn x = torch.Size([880801, 256, 5]) x = torch.Size([65535, 256, 5]) x = torch.Size([880801, 64, 4, 4]) x = torch.Size([65535, 64, 4, 4]) # for per-act bn x = torch.Size([131070, 2048]) x = torch.Size([262136, 2048]) ``` for `training()` and `eval()` mode using `torch.float32` and `torch.float16`. I've increased the shape of our current smoke test to, but I can also add all use cases of the support matrix, if wanted. CC ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/32763 Differential Revision: D19644328 Pulled By: ngimel fbshipit-source-id: c2151bf9fe6bac79b8cbc69cff517a4b0b3867aa	2020-01-30 16:57:15 -08:00
Mike Ruberry	413c0f6c29	Fixes moving after weight norm application (#32563 ) Summary: This PR updates how RNNs handle their "flat weights." In particular, it allows for only some flat weights to be "materialized" when apply is called, and it updates the flattening behavior to only apply if all flat weights are (1) materialized, (2) share a dtype and (3) are acceptable to cuDNN. One test is modified and another created to test these changes. One practical effect of this change is that weight norm can be successfully applied to a module BEFORE that module is moved to an accelerator. Previously doing so would throw an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32563 Differential Revision: D19602725 Pulled By: mruberry fbshipit-source-id: d8f9441d17815c8c9ba15b256d4be36f784a3cf9	2020-01-30 10:31:11 -08:00
Pavel Belevich	85bd3e5bdb	Removing @expectedFailureXLA from test_nll_loss_empty_tensor_reduction_mean (#32701 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32701 Because it's disabled in XLA(https://github.com/pytorch/xla/pull/1563) Discussed in https://github.com/pytorch/xla/issues/1539 Test Plan: Imported from OSS Differential Revision: D19633349 Pulled By: pbelevich fbshipit-source-id: b9a81c976a96b325356ff210ff838dfcd5352db7	2020-01-30 07:38:12 -08:00
Natalia Gimelshein	2e359ef86d	enable empty batch for all flavor of convolutions (#32709 ) Summary: resubmitting https://github.com/pytorch/pytorch/issues/32612 after a merge gone wrong. Enables convolution with an empty batch or number of channels for all flavors of convolution (grouped convolution, convTranspose). Would make https://github.com/pytorch/pytorch/issues/31658 unnecessary. Also returns zero gradients for the parameters, that's necessary for correct DDP operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32709 Differential Revision: D19627968 Pulled By: ngimel fbshipit-source-id: 7359759bd05ff0df0eb658cac55651c607f1b59f	2020-01-29 16:33:48 -08:00
Kurt Mohler	8cb05e72c6	Port BCELoss to ATen to increase accuracy (#31365 ) Summary: Fixes issue https://github.com/pytorch/pytorch/issues/24933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31365 Differential Revision: D19557712 Pulled By: ezyang fbshipit-source-id: 3ae78c949b2f6c21b294d986d28e09daa9b0c526	2020-01-29 12:58:37 -08:00
Edward Yang	f0917dce7f	Revert D19562258: [pytorch][PR] Fixes moving after weight norm application Test Plan: revert-hammer Differential Revision: D19562258 Original commit changeset: 4fef006e32cd fbshipit-source-id: 62e40de19331a61f4a65b7371460fe7dc28f23ea	2020-01-27 11:18:19 -08:00
Mike Ruberry	e36cbb8f2f	Fixes moving after weight norm application (#32563 ) Summary: This PR updates how RNNs handle their "flat weights." In particular, it allows for only some flat weights to be "materialized" when apply is called, and it updates the flattening behavior to only apply if all flat weights are (1) materialized, (2) share a dtype and (3) are acceptable to cuDNN. One test is modified and another created to test these changes. One practical effect of this change is that weight norm can be successfully applied to a module BEFORE that module is moved to an accelerator. Previously doing so would throw an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/32563 Differential Revision: D19562258 Pulled By: mruberry fbshipit-source-id: 4fef006e32cdfd8e3e3d519fc2ab5fc203dd7b36	2020-01-27 09:57:43 -08:00
Sameer Deshmukh	602394e996	verify input sizes for instance norm and group norm (#29082 ) Summary: Fix for https://github.com/pytorch/pytorch/issues/19250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29082 Differential Revision: D19373507 Pulled By: ezyang fbshipit-source-id: 231a79280f4cd7db2c26218a60869356a124bf77	2020-01-27 09:05:56 -08:00
Jianyu Huang	3ada2e0d64	[pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049 ) Summary: Pull Request resolved: https://github.com/pytorch/glow/pull/4049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477 We would like to add the intra-op parallelization support for the EmbeddingBag operator. This should bring speedup for the DLRM benchmark: https://github.com/pytorch/pytorch/pull/24385 Benchmark code: ``` from __future__ import absolute_import, division, print_function, unicode_literals import torch import time eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum') input = torch.LongTensor(1500).random_(0, 1000000) offsets = torch.zeros(64, dtype=torch.int64) niter = 10000 s = time.time() for _ in range(niter): out = eb(input, offsets) time_per_iter = (time.time() - s) / niter print('time_per_iter', time_per_iter) print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9) ``` The following results are single core on Skylake T6: - Before our change (with the original caffe2::EmbeddingLookup) time_per_iter 6.313693523406982e-05 GB/s 6.341517821789133 - After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths. time_per_iter 5.7627105712890626e-05 GB/s 6.947841559053659 - With Intel's PR: https://github.com/pytorch/pytorch/pull/24385 time_per_iter 7.393271923065185e-05 GB/s 5.415518381664018 For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6. ghstack-source-id: 97124557 Test Plan: With D16990830: ``` buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench ``` With D17750961: ``` buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb ``` OSS test ``` python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu ``` Buck test ``` buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets" --print-passing-details ``` Generate the AVX2 code for embedding_lookup_idx_avx2.cc: ``` python hp_emblookup_codegen.py --use-offsets ``` Differential Revision: D17768404 fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700	2020-01-23 21:29:44 -08:00
Xiang Gao	ad4fba0ce4	Only run test_conv_large and test_conv_transposed_large_cuda on 32GB device (#32473 ) Summary: For some reason, these two tests start to fail on 16GB Volta on Linux... Also fixes https://github.com/pytorch/pytorch/issues/31650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32473 Differential Revision: D19538314 Pulled By: ngimel fbshipit-source-id: 266195f19d8cf76b035795e0e318c152ae72adc2	2020-01-23 14:50:24 -08:00
Guanheng Zhang	db02a4e4ce	Support 3D attention mask in MultiheadAttention. (#31996 ) Summary: Support a 3D attention mask for MultiheadAttention. If `attn_mask` has the batch dimension, it will not be unsqueezed. Fix https://github.com/pytorch/pytorch/issues/30678 Relevant issues/pr: https://github.com/pytorch/pytorch/pull/25359 https://github.com/pytorch/pytorch/issues/29520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31996 Differential Revision: D19332816 Pulled By: zhangguanheng66 fbshipit-source-id: 3448af4b219607af60e02655affe59997ad212d9	2020-01-23 13:16:48 -08:00
Pavel Belevich	9af5a97b1d	Fix nll_loss to support empty tensors on GPU (#31491 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31491 Fixes #31472 Test Plan: Imported from OSS Differential Revision: D19537231 Pulled By: pbelevich fbshipit-source-id: 20a43251a0f68a7a3557dd8234daee2d4814e5dd	2020-01-23 11:45:59 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00
Peter Bell	e37a24b044	Always return a new tensor from nn.functional.pad (#32350 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/31734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/32350 Differential Revision: D19501845 Pulled By: ezyang fbshipit-source-id: ea79496d23dc0016f3caa233c53d283b08f60371	2020-01-22 08:03:42 -08:00
Yuxin Wu	b543e3cd6f	support empty batch in group normalization (#32401 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32401 https://github.com/pytorch/pytorch/issues/12013 Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- 'test_GroupNorm_empty' Differential Revision: D19463720 fbshipit-source-id: 8ae44590fc5eeb1adc69a2345d7cc2187d3307ac	2020-01-19 19:04:54 -08:00
jiej	10c2bd35af	Fix cudnn channels_last descriptors problem (#31952 ) Summary: This is to append fixes to https://github.com/pytorch/pytorch/issues/31783 so we can pull the fixes in without breaking tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31952 Differential Revision: D19433839 Pulled By: ngimel fbshipit-source-id: 5b3d2f0b2a86aacd1d100dd86996ee0d63e5ee92	2020-01-17 17:45:07 -08:00
Xiang Gao	8746f90cf6	Fix weight backward for cudnn conv of large tensor (#31889 ) Summary: This is the last PR for https://github.com/pytorch/pytorch/issues/22496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31889 Differential Revision: D19431371 Pulled By: ngimel fbshipit-source-id: 754fa91d49ad03549cb07aa30dde34bf9e851302	2020-01-16 14:15:52 -08:00
Tongzhou Wang	c6f41ae01b	Fix and add more padding mode support for Conv (#31784 ) Summary: Fix https://github.com/pytorch/pytorch/issues/29712 #29668 , add arg checking, doc, and support for reflection and replication padding modes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31784 Differential Revision: D19301974 Pulled By: ezyang fbshipit-source-id: a0ed4815c0c22e416b16e256bba04324e376b2f8	2020-01-10 08:14:58 -08:00
rohithkrn	985fd970aa	Enable BFloat16 support for Convolutions on ROCm (#30948 ) Summary: This PR adds bfloat16 support for convolutions on ROCm. - Intergrates MIOpen bfloat16 convolution support into PyTorch - Enables bfloat16 convolution for non-miopen paths, i.e THCUNN, native hip kernels - Enables bfloat16 type for probability distribution functions(this is included in this PR since conv unit tests use bfloat16 random number generators) Native cuda kernels for convolution and random functions will be compiled for CUDA as well. iotamudelta bddppq Pull Request resolved: https://github.com/pytorch/pytorch/pull/30948 Differential Revision: D19274164 Pulled By: ezyang fbshipit-source-id: c0888a6ac72a2c5749b1ebb2195ac6f2209996be	2020-01-07 06:57:35 -08:00
BowenBao	c4f10e0fe7	Renaming scales parameter for interpolate (#31526 ) Summary: PR separated from https://github.com/pytorch/pytorch/pull/31274. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31526 Reviewed By: zou3519 Differential Revision: D19221931 Pulled By: gchanan fbshipit-source-id: 81958a9910867ac9d62f2b47abc49384526c4e51	2020-01-02 08:19:30 -08:00
Mingbo Wan	647569e546	get rid of choco install (#30897 ) Summary: 7zip and cmake are part of base image, no need to re-install. Remove the install step can make build/test more stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30897 Differential Revision: D19232961 Pulled By: mingbowan fbshipit-source-id: fa3bbd1325839a2a977bf13fdbd97fda43793b8d	2019-12-27 13:12:04 -08:00
Jie	909b8eba0d	cudnn grouped convolution nhwc patch (#31444 ) Summary: Earlier cudnn version doesn't support grouped convolution in NHWC well. Legit configuration in later cudnn version might return CUDNN_STATUS_NOT_SUPPORTED. We are falling back to NCHW when runtime check of cudnn version is < 7.6.0 to keep the logic simple. Note: We might update the heuristics, 7.6.0 is very conservative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31444 Differential Revision: D19232414 Pulled By: VitalyFedyunin fbshipit-source-id: 4c2d79ed347c49cd388bbe5b2684dbfa233eb2a3	2019-12-26 17:16:02 -08:00
Xiang Gao	218cfd568d	Conv transpose/backward split 32bit (#31510 ) Summary: Basically the same as https://github.com/pytorch/pytorch/pull/31379 except for that I write a separate function `split_batch_dim_to_32bit_out` for the logic. This function could also be used for convolution forward, and I will rebase this PR after https://github.com/pytorch/pytorch/issues/31379 get merged and then change `raw_cudnn_convolution_forward_out` to use `split_batch_dim_to_32bit_out` here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31510 Differential Revision: D19210563 Pulled By: ngimel fbshipit-source-id: e20bb82b6360aa2c0e449e127188c93f44e1e9b4	2019-12-23 11:34:17 -08:00
Xiang Gao	0b0f90f53c	Split on batch dimension when 32bit indexing not enough for convolution forward (#31379 ) Summary: Partially fixes https://github.com/pytorch/pytorch/issues/22496 This is just a first step towards the support of 64bit convolution on CUDA. In the forward of convolution, if the total tensor size is larger than 2^31, then we split it on the batch dimension. I want to get some review feedback before moving forward for the same splitting approach for backward. There are real-world use cases that even when N=1 the input is still larger than 2^31. For this case, the splitting would be complicated, so I am planning to modify `use_cudnn` to just dispatch to the slow fallback kernel in PyTorch in a later PR. Update: `later PR` is https://github.com/pytorch/pytorch/pull/31383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31379 Differential Revision: D19192018 Pulled By: ngimel fbshipit-source-id: c26ecc56319ac67c4d5302ffed246b8d9b5eb972	2019-12-20 21:27:06 -08:00
Xiang Gao	624088e444	Don't dispatch to cudnn if it is not possible to make it 32bit by splitting batch dim (#31383 ) Summary: Also a step towards supporting 64bit indexing in convolution. See also: https://github.com/pytorch/pytorch/pull/31379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/31383 Differential Revision: D19183443 Pulled By: ngimel fbshipit-source-id: 0c2030fac147e629d7be0c29f0683ec2b3f28c71	2019-12-19 18:00:03 -08:00
Vitaly Fedyunin	66f2bba852	Adding function to convert Module to channels last Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28991 Test Plan: Imported from OSS Differential Revision: D18430810 Pulled By: VitalyFedyunin fbshipit-source-id: 0693d4e31fc6f9831722c29fc83517f16ddfc028	2019-12-12 11:38:35 -08:00
Lara	97c1e90f46	ONNX Interpolate Add Scales Params (#28324 ) Summary: Fix for : https://github.com/pytorch/pytorch/issues/27176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/28324 Reviewed By: hl475 Differential Revision: D18309133 Pulled By: houseroad fbshipit-source-id: 348bb41393442c6b107d88fc2cd3224e0afa3ccf	2019-12-11 20:09:15 -08:00
Pavel Belevich	4bb497b38e	MultiheadAttention fixes Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30666 Test Plan: Imported from OSS Differential Revision: D18864094 Pulled By: pbelevich fbshipit-source-id: f7a634b2c7f526282bf918d47b9cc82aa0c0af1d	2019-12-07 09:42:10 -08:00
Xiang Gao	2011cc1e91	Fix half->float case of softmax backward when inner_size is not 1 (#30838 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/30572 That unit test is tested to fail with master and success with this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30838 Differential Revision: D18841066 Pulled By: ngimel fbshipit-source-id: 86a7ccdb3016c98d62dd0946daff101704cd1f68	2019-12-06 00:25:34 -08:00
xiaobing.zhang	82c3f4861f	Move hardtanh activation to Aten(CPU, CUDA) (#30152 ) Summary: VitalyFedyunin, This PR is about port Hardtanh activation to Aten: Test script: ``` import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): if torch.cuda.is_available(): torch.cuda.synchronize() return time.time() device = "cpu" m = nn.Hardtanh() if torch.cuda.is_available(): device = "cuda" m = m.cuda() #warm up for n in [100, 10000]: input = torch.randn(128, n, requires_grad=True, device=device) grad_output = torch.ones(128, n, device=device) for i in range(1000): output = m(input) output.backward(grad_output) for n in [100, 10000]: input = torch.randn(128, n, requires_grad=True, device=device) grad_output = torch.ones(128, n, device=device) fwd_t = 0 bwd_t = 0 for i in range(10000): t1 = _time() output = m(input) t2 = _time() output.backward(grad_output) t3 = _time() fwd_t = fwd_t + (t2 -t1) bwd_t = bwd_t + (t3 - t2) fwd_avg = fwd_t / 10000 * 1000 bwd_avg = bwd_t / 10000 * 1000 print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)." % (n, fwd_avg, bwd_avg)) ``` Test Device: CPU: skx-8180, GPU: Tesla P40. Perfromance: Before: ``` GPU: input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms). input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms). CPU input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.06 (ms). input size(128, 10000) forward time is 0.84 (ms); backwad avg time is 0.44 (ms). ``` After: ``` GPU: input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.11 (ms). input size(128, 10000) forward time is 0.06 (ms); backwad avg time is 0.17 (ms). CPU input size(128, 100) forward time is 0.02 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 0.61 (ms); backwad avg time is 0.10 (ms). ``` `OMP_NUM_THREADS=1:` ``` Before: input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.07 (ms). input size(128, 10000) forward time is 5.21 (ms); backwad avg time is 5.25 (ms). After: input size(128, 100) forward time is 0.01 (ms); backwad avg time is 0.02 (ms). input size(128, 10000) forward time is 1.09 (ms); backwad avg time is 1.09 (ms). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/30152 Differential Revision: D18815545 Pulled By: VitalyFedyunin fbshipit-source-id: d23b6b340a7276457f22dce826bcbe3b341d755f	2019-12-05 15:28:03 -08:00
Gregory Chanan	0974dcc244	Fix error checking of CUDA multi_margin_loss. (#30825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30825 It didn't verify in the 1-d case that the targets were size 1.. Test Plan: Imported from OSS Differential Revision: D18833659 Pulled By: gchanan fbshipit-source-id: 9b0276e7b0423fdaf2ba7cfa34bde541558c61f9	2019-12-05 14:23:00 -08:00
Brian Vaughan	a376dd344c	Added check for torch.where on CPU that both arguments have same dtype (#30662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30662 Cherry picked from: https://github.com/pytorch/pytorch/pull/29081 Test Plan: Imported from OSS Differential Revision: D18782295 Pulled By: nairbv fbshipit-source-id: 897ab25ddf8819ca34f5e86c5d3f41debb56cb04 Co-authored-by: ifedan	2019-12-03 15:19:52 -08:00
Brian Wignall	e7fe64f6a6	Fix typos (#30606 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606 Differential Revision: D18763028 Pulled By: mrshenli fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c	2019-12-02 20:17:42 -08:00
Peter Bell	37ca5a8a64	convert_sync_batchnorm should not convert _InstanceNorm instances (#29985 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/29187 This introduces a new class `_NormBase` that `_InstanceNorm` and `_BatchNorm` inherit from separately. This means the `isinstance(module, _BatchNorm)` check won't falsely pass for `_InstanceNorm`. The suggested fix of adding `and not isinstance(module, _InstanceNorm)` works as well, but requires introducing a cyclic dependency between `instancenorm.py` and `batchnorm.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29985 Differential Revision: D18588104 Pulled By: yf225 fbshipit-source-id: f599da3b902ad9c56836db4d429bfc462ed51338	2019-11-19 09:39:36 -08:00
Natalia Gimelshein	a9ad2e2f00	fix batch norm for empty inputs (#30035 ) Summary: Fix for https://github.com/pytorch/pytorch/issues/29578 Shape check is moved up as much as possible, because backends by and large don't correctly handle empty inputs, so check needs to be done before backend selection. That also automatically takes care of backward, because forward for empty input is automatically differentiable, so no backend-specific backward routines are ever called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30035 Test Plan: tests for empty inputs are added. Differential Revision: D18584427 Pulled By: ngimel fbshipit-source-id: a42918f50eb1f6995921aafa92879cd42dd5e9e1	2019-11-18 23:08:12 -08:00
Jie	c5ac70a0ea	AdaptiveAvgPooling nhwc cuda update (#29700 ) Summary: 1. Add clip on grid launch configs (Tests added in test_nn.py) 2. Assert on shared memory requirement, gives better hint when error out; Pull Request resolved: https://github.com/pytorch/pytorch/pull/29700 Differential Revision: D18482556 Pulled By: VitalyFedyunin fbshipit-source-id: df3f653185d7b477b2241f2ef4779670e9a78899	2019-11-14 11:02:48 -08:00
Ashkan Aliabadi	9ee6fa0145	Use NNPACK for strided convolutions. (#29595 ) Summary: Use NNPACK for strided convolutions. ResNet50 on Pixel 3: - Before: 552.956 ms - After: 402.947 ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/29595 Reviewed By: houseroad Differential Revision: D18457472 Pulled By: AshkanAliabadi fbshipit-source-id: 51f22ce120c39f197cd564bcc71bbad2951edf85	2019-11-13 17:10:41 -08:00
Lu Fang	466ab93ef5	Revert D18286473: Use NNPACK for strided convolutions. Test Plan: revert-hammer Differential Revision: D18286473 Original commit changeset: accdfafa2c24 fbshipit-source-id: dc1347eb2738009c7f44699fc46b6cb80c54e2e3	2019-11-10 08:11:11 -08:00
Ashkan Aliabadi	5ba9209755	Use NNPACK for strided convolutions. (#29084 ) Summary: Use NNPACK for strided convolutions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29084 Differential Revision: D18286473 Pulled By: AshkanAliabadi fbshipit-source-id: accdfafa2c247f2750208a7af84c9e2c0374920b	2019-11-09 21:21:55 -08:00
Michela Paganini	8e8a5e0664	Pruning Functionality (#24076 ) Summary: Provides implementation for feature request issue https://github.com/pytorch/pytorch/issues/20402. Adds pruning functionalities (structured and unstructured, local and global, as well as pruning from user-provided mask). Associated tutorial here: https://github.com/pytorch/tutorials/pull/605 cc: soumith Pull Request resolved: https://github.com/pytorch/pytorch/pull/24076 Differential Revision: D18400431 Pulled By: mickypaganini fbshipit-source-id: a97bd6ca61f8600ae411da9ff6533c232aae1a51	2019-11-08 19:38:00 -08:00
Xiang Gao	02921e7985	Use cuDNN's handle pool mechanism to manage cublas handles (#29233 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/6962 The PR implements the handle pool mechanism for cublas as suggested by mcarilli in https://github.com/pytorch/pytorch/issues/6962#issuecomment-530563872. ~~I didn't add any unit test here yet because as mcarilli mentioned:~~ > ~~On my local machine, out of curiosity I also rewrote that test to use gemms instead of convolutions. The race condition seemed rarer, but the test did show that cublas use is not thread safe. I can share the script if you want.~~ ~~Please share your script with me mcarilli. And if the race condition is rare, would it still be possible for the CI to detect it?~~ cc: colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/29233 Differential Revision: D18372007 Pulled By: ezyang fbshipit-source-id: 3492bf13410598e8452e89cf4e3e63e8df9c8c3d	2019-11-07 12:50:18 -08:00
Jie	fdab1cf0d4	NHWC support in cuDNN BatchNorm & Conv2d (#29361 ) Summary: This reverts the `9a9bb448ee` Fixing the broken case which reverts the previous commit. details about fix: modified: aten/src/ATen/native/Convolution.cpp called contiguous on 3D input tensor. This avoids the code path to accidentally recognize the input as channel_last stride, due to unsqueezing of permuted 3d tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29361 Differential Revision: D18371964 Pulled By: VitalyFedyunin fbshipit-source-id: a5985f4687b37e183649fa35b8ccdb50368ebfdf	2019-11-07 10:39:58 -08:00
Vitaly Fedyunin	9a9bb448ee	Revert cudnn changes #23861 (#29329 ) Summary: Broken case: ```python x = torch.randn(192,16,50).cuda() x = x.permute(0,2,1).contiguous().permute(0,2,1) m = torch.nn.Conv1d( in_channels=16, out_channels=32, kernel_size=2, bias=True, ).cuda() m(x) ``` This reverts commit `8160f390cf`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29329 Differential Revision: D18357674 Pulled By: VitalyFedyunin fbshipit-source-id: cdd7e77e8dcbfc5f2ab3df54eb53ccfbf703b245	2019-11-06 17:38:46 -08:00
xiaobing.zhang	e01324d058	Port l1_loss to Aten (#26795 ) Summary: VitalyFedyunin, This PR is about port L1 lose to Aten: Test script: ``` import torch import torch.nn as nn import time torch.manual_seed(0) def _time(): if torch.cuda.is_available(): torch.cuda.synchronize() return time.time() device = "cpu" loss = nn.L1Loss(reduction = 'sum') if torch.cuda.is_available(): device = "cuda" loss = loss.cuda() #warm up for n in [100, 10000]: input = torch.randn(128, n, requires_grad=True, device=device) target = torch.randn(128, n, device=device) for i in range(1000): output = loss(input, target) output.backward() #get running time for n in [100, 10000]: fwd_t = 0 bwd_t = 0 input = torch.randn(128, n, requires_grad=True, device=device) target = torch.randn(128, n, device=device) for i in range(10000): t1 = _time() output = loss(input, target) t2 = _time() output.backward() t3 = _time() fwd_t = fwd_t + (t2 -t1) bwd_t = bwd_t + (t3 - t2) fwd_avg = fwd_t / 10000 * 1000 bwd_avg = bwd_t / 10000 * 1000 print("input size(128, %d) forward time is %.2f (ms); backwad avg time is %.2f (ms)." % (n, fwd_avg, bwd_avg)) ``` Test Device: CPU: skx-8180, GPU: Tesla P100. Perfromance: Before: ``` GPU: reduction=’mean’ nput size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.09 (ms). input size(128, 10000) forward time is 0.33 (ms); backwad avg time is 0.14 (ms). reduction=’sum’ input size(128, 100) forward time is 0.31 (ms); backwad avg time is 0.10 (ms). input size(128, 10000) forward time is 0.34 (ms); backwad avg time is 0.14 (ms). CPU: reduction=’mean’ input size(128, 100) forward time is 0.06 (ms); backwad avg time is 0.10 (ms). input size(128, 10000) forward time is 1.92 (ms); backwad avg time is 2.96 (ms). reduction=’sum’ input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.09 (ms). input size(128, 10000) forward time is 1.96 (ms); backwad avg time is 2.79 (ms). nume_thread = 1: reduction=’mean’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.50 (ms). reduction=’sum’: input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 1.67 (ms); backwad avg time is 2.51 (ms). ``` After: ``` GPU: reduction=’mean’ input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.10 (ms). input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.17 (ms). reduction=’sum’ input size(128, 100) forward time is 0.05 (ms); backwad avg time is 0.08 (ms). input size(128, 10000) forward time is 0.11 (ms); backwad avg time is 0.16 (ms). CPU: reduction=’mean’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 0.14 (ms); backwad avg time is 0.18 (ms). reduction=’sum’ input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 0.15 (ms); backwad avg time is 0.17 (ms). nume_thread = 1: reduction=’mean’: input size(128, 100) forward time is 0.04 (ms); backwad avg time is 0.06 (ms). input size(128, 10000) forward time is 1.05 (ms); backwad avg time is 1.72 (ms). reduction=’sum’: input size(128, 100) forward time is 0.03 (ms); backwad avg time is 0.05 (ms). input size(128, 10000) forward time is 1.03 (ms); backwad avg time is 1.71 (ms). ``` How to set number thread? using following script: ``` num_threads=$1 script=$2 last_core=`expr $num_threads - 1` echo "using $num_threads OMP threads" echo "bind cores to 0~$last_core" export OMP_NUM_THREADS=$num_threads export KMP_AFFINITY=granularity=fine,compact,1,0 numactl --physcpubind=0-$last_core --membind=0 python $script ``` and run `./run.sh 1 L1loss.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26795 Differential Revision: D18140434 Pulled By: VitalyFedyunin fbshipit-source-id: d0b976ec36797f2e6b4e58fbbac89688d29e736f	2019-11-04 13:20:07 -08:00
Jie	8160f390cf	(#23861 ) Summary: Added nhwc support for: 1. cudnn_batch_norm & cudnn_batch_norm_backward 2. cudnn_convolution_forward & cudnn_convolution_backward 3. cudnn_convolution_transpose & cudnn_convolution_transpose_backward patching suggest_memory_format for convolution suggest_memory_format has ambiguous meaning for two cases: 1. tensor with NCHW where C = 1. we could use stride of C as a hint to tell the intended memory format. 2. tensor with NCHW where H == W == 1. there's no way to identify the intended memory format from strides. Currently we fallback to NCHW whenever we see contiguous tensor. Hence avoiding ambiguity for some of the special cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/23861 Differential Revision: D18263434 Pulled By: VitalyFedyunin fbshipit-source-id: dd9f69576ec12fec879cd87a3d446931371360d9	2019-11-04 09:11:50 -08:00
Jie	70f3f23e3a	(#29016 ) Summary: Adding limitation on launch config for grid size Test added in test_cuda; Pull Request resolved: https://github.com/pytorch/pytorch/pull/29016 Differential Revision: D18293788 Pulled By: ngimel fbshipit-source-id: 44de308b05a4fe44bfffc2f3713fd9fa67ef74fa	2019-11-04 08:50:18 -08:00
jokerkeny	aa30176c68	Add C++ API clip_grad_value_ for nn:utils (#28736 ) Summary: Adds C++ API clip_grad_value_ for torch::nn:utils module. Also, fix the for indent level error in the original test/test_nn.py. Issue: https://github.com/pytorch/pytorch/issues/25883 Reviewer: yf225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/28736 Differential Revision: D18263807 Pulled By: yf225 fbshipit-source-id: 29282450bd2099df16925e1d0edd3d933f6eeb9b	2019-10-31 19:11:54 -07:00
Soumith Chintala	c63e15aef8	Revert D18241759: Test Plan: revert-hammer Differential Revision: D18241759 Original commit changeset: 8f2535bb0bc4 fbshipit-source-id: 870ac8e860e31f32138d42d470321e225a19990d	2019-10-31 07:54:26 -07:00
Jie	1b1e3d565c	(#28927 ) Summary: This is to fix https://github.com/pytorch/pytorch/issues/22526 Adding limitation on launch config for grid sizes as well, previous code is asking to launch blocks more than what's supported by the hardware; Test added in test_cuda; Pull Request resolved: https://github.com/pytorch/pytorch/pull/28927 Differential Revision: D18241759 Pulled By: soumith fbshipit-source-id: 8f2535bb0bc4ea7998024b137576a38067668999	2019-10-31 01:00:47 -07:00
Anjali Chourdia	efbaa8a563	added a check for zero stride Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28784 Differential Revision: D18178889 Pulled By: anjali411 fbshipit-source-id: 976810bf3f9def3a8f5ca6885b1e049b831f06f3	2019-10-29 12:08:38 -07:00
Jie	e263dd3853	(#24396 ) Summary: Initial kernel support added for optimized NHWC tensor. TODO: currently backwards kernel spits out tensor with NHWC stride. Unfortunately autograd restores grad to contiguous (in either copy or add). This makes real perf tuning annoying to do. (since I cannot easily measure end-to-end time in my python script) My current kernel is blazing fast comparing to the original NCHW kernel in fp16, since I avoided atomicAdd. I'll finish perf tuning after we merged some future PR expanding NHWC support in the core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24396 Differential Revision: D18115941 Pulled By: VitalyFedyunin fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834	2019-10-24 11:57:15 -07:00
Igor Fedan	bc57967e07	max_pool2d cuda should have channel last optimized kernels[Performance improvement] (#24872 ) Summary: max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(https://github.com/pytorch/pytorch/issues/23815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/24872 Differential Revision: D16964577 Pulled By: ifedan fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb	2019-10-21 11:28:12 -07:00
Pritam Damania	99271ad411	Split out data_parallel tests from test_nn.py into a separate (#28297 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28297 Splitting data parallel tests out of test_nn.py since its easier to manage and track these tests separately and failures can be routed to appropriate POCs. Test Plan: waitforbuildbot Differential Revision: D18011663 fbshipit-source-id: 17ebf7c04e7dc7ff4c8d38458daab5b911bed75d	2019-10-18 17:48:40 -07:00
davidriazati	2e7dd54796	Fix RNN nonlinearity (#28058 ) Summary: This was referenced in the `RNN` docs but wasn't actually assigned Pull Request resolved: https://github.com/pytorch/pytorch/pull/28058 Pulled By: driazati Differential Revision: D17945867 fbshipit-source-id: 0f0dc2633183a7e67a12352a2a7ac0545284666a	2019-10-17 16:46:09 -07:00
Mike Ruberry	8fff54ec39	Enables non-default CUDA stream in test_nn (#28192 ) Summary: Per title. Several stream fixes have gone in that may make this pass in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/28192 Differential Revision: D17974219 Pulled By: mruberry fbshipit-source-id: 543d000789c83711a8b4bef169a87635fda7508b	2019-10-17 10:19:49 -07:00
Thomas Viehmann	f461184505	Use grad_out for cudnn CTC loss (#27039 ) Summary: Using grad_out for CuDNN CTC loss fixes: https://github.com/pytorch/pytorch/issues/26797, https://github.com/pytorch/pytorch/issues/25833. We also fix a cudnn incompatible change that surfaced during the testing: As of CuDNN 7.6 the semantics of the CTC loss gradients are different. This leads us to disable CuDNN CTC for CuDNN < 7.6. To mitigate the impact on users, we convert the parameters for the native implementation if CuDNN isn't applicable (previously this would give an error.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/27039 Differential Revision: D17910815 Pulled By: ngimel fbshipit-source-id: 465b33612d3402f10c355aa7026a7e1ffaef3073	2019-10-15 11:36:37 -07:00
Ethan Steinberg	848d1ba13a	Fix padding_idx in the new embedding cuda kernel. (#27731 ) Summary: The current embedding backwards CUDA kernel is somewhat broken. It effectively ignores padding_idx and also incorrectly drops an index from the input. This commit fixes that bug and fixes the unit test so that this behavior won't break in the future. This fixes https://github.com/pytorch/pytorch/issues/26302. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27731 Differential Revision: D17893803 Pulled By: ngimel fbshipit-source-id: 4ba02a17ec0e29a7016d65480d4ff0c276550616	2019-10-13 21:18:49 -07:00
Mike Ruberry	f6bda1e07b	Removes @default_floating_dtype decorator (#27628 ) Summary: One fewer legacy decorator cluttering the test suite. Functions relying on this decorator were updated or, in the case of test_sparse, the test suite was put back on double by default. Note: this PR is blocked on https://github.com/pytorch/pytorch/issues/27599. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27628 Differential Revision: D17896254 Pulled By: mruberry fbshipit-source-id: 13d460301f50ef4af7a660372432108164c0de1f	2019-10-12 12:39:34 -07:00
Thomas Viehmann	e66e00cd17	Fix native ctc_loss gradient indexing bug for large target sizes (#27460 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/27442 Thank you Mohamed Yousef (ASDen) for the report with minimal reproducing example and detailed analysis! Pull Request resolved: https://github.com/pytorch/pytorch/pull/27460 Differential Revision: D17789378 Pulled By: soumith fbshipit-source-id: dc01a31b998cced4462e933d4b32e09b331f7e41	2019-10-09 19:26:47 -07:00
Guanheng Zhang	eb93200321	Fix DDP incompatibility issue with nn.MultiheadAttention. (#26826 ) Summary: Fix issue https://github.com/pytorch/pytorch/issues/26698. With different query/keys/value dimensions, `nn.MultiheadAttention` has DDP incompatibility issue because in that case `in_proj_weight` attribute is created but not used. Fix it and add a distributed unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26826 Differential Revision: D17583807 Pulled By: zhangguanheng66 fbshipit-source-id: c393584c331ed4f57ebaf2d4015ef04589c973f6	2019-10-08 12:13:34 -07:00
Mike Ruberry	7f183a978f	Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444 ) Summary: This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers. Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are: - test_autograd.py - test_distributions.py - test_jit.py - test_nn.py This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting. Notable technical changes in this PR are: - Significant updates to test_torch.py to make it pass without setting the default floating dtype globally. - The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously. - test_torch-specific parts of common_utils were refactored into test_torch. - tensor creation methods in common_utils were updated to accept an optional dtype and device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444 Differential Revision: D17795235 Pulled By: mruberry fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1	2019-10-08 09:52:44 -07:00
Mike Ruberry	527b10c2d1	Fixes PackedSequence.to (and unifies PackedSequence conversions) (#27245 ) Summary: PackedSequence.to(device) incorrectly places one of three tensors on the device and leaves the other two tensors where they are. If these devices are distinct then further operations on PackedSequence will fail. This behavior is inconsistent with Tensor.to and PackedSequence's behavior when .cuda() is called. Additionally, PackedSequence defines multiple other conversion functions that were independently and inconsistently implemented. This PR unifies all implementations and makes the PackedSequence.to behavior more consistent with Tensor.to. It is not completely consistent per comments. test_device_mask in test_nn.py is updated to validate the new functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27245 Differential Revision: D17757850 Pulled By: mruberry fbshipit-source-id: 58f0bd40f1aa300fb0a91ee743483d645f977dc5	2019-10-04 02:22:41 -07:00
Mike Ruberry	21c229f4e1	Makes more of test_nn generic (#27137 ) Summary: test_nn.py will still require significant work to make generic, however I'm trying to break up the PRs into more manageable chunks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27137 Differential Revision: D17718488 Pulled By: mruberry fbshipit-source-id: 4d9359414838a1d2a957d7a334f6a5df6cb00aeb	2019-10-02 11:35:44 -07:00
Mike Ruberry	3099732017	Creates device generic cuDNN decorators (#26791 ) Summary: - Creates skipCUDAIfNoCudnn, skipCUDAIfCudnnVersionLessThan decorators - Makes several test_nn.py tests generic Many tests in test_nn.py test cuDNN. These tests are guarded on various conditionals using TEST_CUDNN and TEST_CUDNN_VERSION imported from common_cuda.py and custom error messages like 'CUDNN not available' and 'needs cudnn.' This PR suggests using the CUDA base test class instead of common_cuda.py to test cuDNN's availability, at least on generic tests. The CUDA base test class is preferable to common_cuda.py since it only creates a CUDA context if its tests are run. Importing from common_cuda.py, on the other hand, always creates a CUDA context. Using the CUDA base test class is also consistent with how other generic tests are guarded and provides consistent skip messages. One quirk to this approach is that it makes use of the self argument to the test functions to check for cuDNN availability during a test. See test_rnn_retain_variables. The self argument could also be used to check the device type instead of the more verbose torch.device(device).type == 'cuda'. An alternative approach to making test_nn.py generic would be to continue to use common_cuda.py imports, try to keep their skip messages consistent, and not worry about creating unnecessary CUDA contexts. This would preclude writing generic tests that can only run on CUDA if cuDNN is available, however, so tests like "_test_RNN_cpu_vs_cudnn" would require additional changes to make into device generic precision tests like "_test_RNN_cpu_vs_xla." For consistency, simplicity, and ease of use, I recommend we adopt the proposed decorators and make use of the self argument when productive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26791 Differential Revision: D17678325 Pulled By: mruberry fbshipit-source-id: 1794735ede9bc9f36856e72b3804b136ad3e0de2	2019-10-01 02:23:54 -07:00
Igor Fedan	ee2c79d699	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#27017 ) Summary: https://github.com/pytorch/pytorch/pull/26981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27017 Differential Revision: D17651454 Pulled By: ifedan fbshipit-source-id: c6313caa11598a0ef160e1c6d2f3c33d03ce80c5	2019-09-28 15:08:41 -07:00
Mike Ruberry	8858f42aa4	Revert D17635651: [pytorch][PR] Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. Test Plan: revert-hammer Differential Revision: D17635651 Original commit changeset: 6ec7615207f5 fbshipit-source-id: 1bd5d01856aabd01ff6b472dfa636bcea91c60a5	2019-09-27 21:09:26 -07:00
Igor Fedan	541de7e140	Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#26981 ) Summary: https://github.com/pytorch/pytorch/issues/24606 Migrate ne and ne_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24740 Migrate ne and ne_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24573 Migrate gt and gt_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24709 Migrate gt and gt_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24556 Migrate eq and eq_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24696 Migrate eq and eq_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24568 Migrate ge and ge_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24703 Migrate ge and ge_ from the TH to Aten (CPU) https://github.com/pytorch/pytorch/issues/24582 Migrate le and le_ from the TH to Aten (CUDA) https://github.com/pytorch/pytorch/issues/24719 Migrate le and le_ from the TH to Aten (CPU) Performance characteristics are similar to https://github.com/pytorch/pytorch/issues/25998 This PR migrates comparison ops from TH to ATen and adds type promotion in the same way as in https://github.com/pytorch/pytorch/issues/25998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26981 Differential Revision: D17635651 Pulled By: ifedan fbshipit-source-id: 6ec7615207f5c248a6dd85fc54c25bd5e6d328e6	2019-09-27 17:28:56 -07:00
Dmytro Dzhulgakov	764bf826e3	Remove fbgemm_is_cpu_supported in favor of torch.backends.quantized.supported_qengines (#26840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26840 Cleaning up top-level namespace. Also cosmetic changes to torch.backends.quantized Test Plan: Imported from OSS Differential Revision: D17604403 Pulled By: dzhulgakov fbshipit-source-id: c55af277ea7319d962a82a6120f65ccd47a60abc	2019-09-27 13:45:15 -07:00
Edward Yang	1cae5195a6	Refactor checked_tensor_unwrap to take DeviceType instead of Backend (#26290 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26290 Fixes #26206 Happily, I also can delete the dead Dense***Tensor cases, since they are for the defunct THS backend. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17404368 Pulled By: ezyang fbshipit-source-id: 79d71ad40c4325c9f52d2825aceb65074d2e20e8	2019-09-25 10:59:07 -07:00
Mike Ruberry	98bbb7788c	Updates and extends TestNNDeviceType (#26638 ) Summary: - Moves several tests to TestNNDeviceType - Merges helper base with TestNNDeviceType <s>- Enables non-default stream for TestNN (like recent updates to TestTorch and TestCUDA)</s> Reverted non-default stream due to failure of test_variable_sequence_cuda (main.TestNN). Pull Request resolved: https://github.com/pytorch/pytorch/pull/26638 Differential Revision: D17543899 Pulled By: mruberry fbshipit-source-id: 001fa191f5fe424f2e7adc378b8fb5ee7f264f16	2019-09-23 22:48:21 -07:00
Sebastian Messmer	fcfca9ad62	Skip some fragile tests (#26599 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26599 These fail due to tolerance in equality comparison. Disable them for now. ghstack-source-id: 90553855 Test Plan: unit tests Differential Revision: D17517085 fbshipit-source-id: a4d9278e356318719ccd84047404915a97944f52	2019-09-21 11:06:42 -07:00
Rajan Singh	916eee182c	Fix for Conv shape check prints overflowed ints (#25827 ) Summary: Fix for issue https://github.com/pytorch/pytorch/issues/19947 Pull Request resolved: https://github.com/pytorch/pytorch/pull/25827 Differential Revision: D17508653 Pulled By: soumith fbshipit-source-id: 1afec60b9b39de5f2d0be44a170650aa4c1879cf	2019-09-20 14:11:47 -07:00
Edward Yang	9b7011c5c2	Implement multiple dispatch (#26468 ) (#26501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D17499154 Pulled By: ezyang fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c	2019-09-20 10:12:04 -07:00
Michael Suo	5304358859	Revert D17481256: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17481256 Original commit changeset: b3206936b4ca fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2	2019-09-19 14:53:40 -07:00
Edward Yang	0705f759a3	Implement multiple dispatch (#26468 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. XLA companion patch at https://github.com/pytorch/xla/pull/1031 Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. The new generated code looks like this: ``` inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const { static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)"); return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(this, src))(const_cast<Tensor&>(this), src, non_blocking); } ``` The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bddppq Differential Revision: D17481256 Pulled By: ezyang fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96	2019-09-19 14:29:38 -07:00
Junjie Bai	07bd76988e	Revert D17265918: Implement multiple dispatch Test Plan: revert-hammer Differential Revision: D17265918 Original commit changeset: 221efe4e86a4 fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b	2019-09-19 09:50:17 -07:00
Edward Yang	ece14ff473	Implement multiple dispatch (#25653 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653 Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id. Billing of changes: * ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there should have been something registered at some key, but there wasn't.) * Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments. * The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into 'this'. I think this may be duplicated with some logic somewhere else but I have to double check. After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse. Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++. * One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it. * A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message) * `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch. * `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity. * c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely. Benchmark: Apply the following patch to the base commit and this commit: ``` diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp new file mode 100644 index 0000000000..b66f4d3ece --- /dev/null +++ b/aten/src/ATen/native/Const.cpp @@ -0,0 +1,10 @@ +#include <ATen/ATen.h> + +namespace at { +namespace native { + +Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) { + return self; +} + +}} // namespace at::native diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml index b494ed7950..fddae638bb 100644 --- a/aten/src/ATen/native/native_functions.yaml +++ b/aten/src/ATen/native/native_functions.yaml @@ -5878,3 +5878,9 @@ dispatch: CPU: im2col_backward_cpu CUDA: im2col_backward_cuda + +# For benchmarking +- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor + variants: function + dispatch: + CPU: _const5 ``` Comparisons with timeit: One-argument, representative case: Before: ``` In [6]: %timeit x.reshape(1, 1) 1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [7]: %timeit x.reshape(1, 1) 1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [8]: %timeit x.reshape(1, 1) 1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit x.reshape(1, 1) 1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit x.reshape(1, 1) 1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit x.reshape(1, 1) 1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments): Before: ``` In [1]: import torch In [2]: x = torch.zeros(1) In [3]: %timeit torch._const5(x, x, x, x, x) 949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` After: ``` In [3]: %timeit torch._const5(x, x, x, x, x) 985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [4]: %timeit torch._const5(x, x, x, x, x) 984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit torch._const5(x, x, x, x, x) 988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D17265918 Pulled By: ezyang fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d	2019-09-19 09:30:40 -07:00
Mike Ruberry	388cfdf2ac	Removes torchtest, expands generic device testing (#26374 ) Summary: - Removes torchtest - <s>Moves test_torch tests skipped on ROCm to generic device test class</s> - Creates test_nn generic device test class Next: adding dtypes to generic device testing framework. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374 Test Plan: Change is to tests themselves. Differential Revision: D17442218 Pulled By: mruberry fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8	2019-09-18 10:24:50 -07:00
Iurii Zdebskyi	b6d1105eb6	Enabled conv methods for the bfloat16 Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26167 Differential Revision: D17367728 Pulled By: izdeby fbshipit-source-id: 0a7bd9a6dbc15815af195d644c9372af2135e93a	2019-09-16 09:47:42 -07:00
Rohan Varma	4e538ebcf3	Migrate away from using Variable( in test_nn.py (#26077 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26077 As per #26071, we would like to get rid of the calls to Variable( where possible. This diff removes the calls in the test file test_nn.py. The unit tests should all still pass as expected. ghstack-source-id: 90086624 Test Plan: tests in `test_nn.py` should all pass. Differential Revision: D17336484 fbshipit-source-id: 43fc7bd0b0be835ae89d06162ce1cbe4e0056d91	2019-09-16 09:37:54 -07:00
Ailing Zhang	3acab233b5	Add device check before accessing data_ptr in PackLayer (#26056 ) Summary: fixes https://github.com/pytorch/xla/issues/927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/26056 Differential Revision: D17331859 Pulled By: ailzhang fbshipit-source-id: bdc334f03c8dcbb4ef4f5e059a63ef188a0b8b61	2019-09-12 19:25:42 -07:00
J M Dieterich	a996b1d653	Make regular softmax warp size aware (#25956 ) Summary: Enable one unit test that passes now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25956 Differential Revision: D17298150 Pulled By: bddppq fbshipit-source-id: 8763e71ad7ef80be915fe93a3471b29f27f3f0a4	2019-09-11 23:16:16 -07:00
J M Dieterich	5376ee51fd	Enable more mGPU tests (#26055 ) Summary: Enable mGPU tests that pass on ROCm as of 2.7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/26055 Differential Revision: D17331484 Pulled By: bddppq fbshipit-source-id: 51f956a84a6c14a1a41473d322950994fa29c25c	2019-09-11 17:54:35 -07:00
J M Dieterich	00d967c39d	enable unit tests (#25963 ) Summary: These unit tests pass after landing all the warp size awareness patches. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25963 Differential Revision: D17319124 Pulled By: bddppq fbshipit-source-id: 22f5d5f1ca9c67e66a7ccf983b2d2f889a74e729	2019-09-11 12:31:43 -07:00
hongzhen	378881e903	Enable log_softmax and CrossEntropyLoss for bfloat16 (#24457 ) Summary: Enabled torch.nn.functional.log_softmax and torch.nn.CrossEntropyLoss for bfloat16 data type. In order to do that, following dependency have to be enabled. - RNE (round to nearest even) - AccumulateType - bfloat16 arithmetic operator overload Also, we implement std::numeric_limits fully support for bfloat16 data type background for dependency: - RNE vs truncate From torch.nn.CrossEntropyLoss test. input_size=(128, 1000) RNE result: float output: tensor(7.3981, dtype=torch.float32, grad_fn=<NllLossBackward>) bfloat16 output: tensor(7.3125, dtype=torch.bfloat16, grad_fn=<NllLossBackward>) truncate result: float output: tensor(7.3981, dtype=torch.float32, grad_fn=<NllLossBackward>) bfloat16 output: tensor(5.8750, dtype=torch.bfloat16, grad_fn=<NllLossBackward>) - scalar_t vs AccumulateType (AccumulateType of bfloat16 is float) AccumulateType is essential to keep accuracy, especially for reduction related operation. we have verified it with both local case and real topology. It turns out that bfloat16 type accumulator would cause huge relative error when elements number is large, even more than 50%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24457 Differential Revision: D17113018 Pulled By: ezyang fbshipit-source-id: 8d61297ca118f9b5c6730a01efcf3a3704d2f206	2019-09-09 09:19:47 -07:00
root	8640aef505	Add support for non-affine batch norm with float stats and half inputs (#22750 ) Summary: This PR creates support for non-affine batch norm with float running estimates and half inputs. Changed were made similar to https://github.com/pytorch/pytorch/issues/16735. I couldn't find a specific test for `SyncBatchNorm`, so I used [this code](https://gist.github.com/ptrblck/ab45bfcde6df55ac28a7be18531f4718) to test it. cc ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/22750 Differential Revision: D17119965 Pulled By: ezyang fbshipit-source-id: 2e8c5d63fc3c636b8a1338c43c9c101a0f5e9b22	2019-08-29 14:04:37 -07:00
Meteorix	0cc92de447	Extend nn.Transformer to support BERT (gelu) (#24181 ) Summary: To use transformer for BERT, we need `gelu` activation. https://github.com/pytorch/pytorch/issues/24177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/24181 Differential Revision: D16790327 Pulled By: zhangguanheng66 fbshipit-source-id: b4eed21ad1a4d753bb090fa7fd78886714a6d761	2019-08-28 12:39:47 -07:00
Will Feng	80974dde4c	Move new_criterion_tests from test_nn.py to common_nn.py (#25333 ) Summary: Moving so that `new_criterion_tests` can be used from `test_cpp_api_parity.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25333 Differential Revision: D17097188 Pulled By: yf225 fbshipit-source-id: 7f7905cc6799bca8dc6b3c9cc43995313c6bc058	2019-08-28 12:22:15 -07:00
bnehoran	74b65c32be	Add align_corners option to grid_sample and affine_grid, change default to False (#24929 ) Summary: Resolves: https://github.com/pytorch/pytorch/issues/20785 Addresses https://github.com/pytorch/pytorch/issues/24470 for `affine_grid` Subsumes and closes: https://github.com/pytorch/pytorch/pull/24878 and likewise closes: https://github.com/pytorch/pytorch/issues/24821 Adds the `align_corners` option to `grid_sample` and `affine_grid`, paralleling the option that was added to `interpolate` in version 0.4.0. In short, setting `align_corners` to `False` allows these functions to be resolution agnostic. This ensures, for example, that a grid generated from a neural net trained to warp 1024x1024 images will also work to warp the same image upsampled/downsampled to other resolutions like 512x512 or 2048x2048 without producing scaling/stretching artifacts. Refer to the documentation and https://github.com/pytorch/pytorch/issues/20785 for more details. #### BC-Breaking Changes - Important: BC-Breaking change because of new default for `align_corners` The old functionality can still be achieved by setting `align_corners=True`, but the default is now set to `align_corners=False`, since this is the more correct setting, and since this matches the default setting of `interpolate`. - Should not cause BC issues: BC-Breaking change for pathological use case 2D affine transforms on 1D coordinates and 3D affine transforms on 2D coordinates (that is, when one of the spatial dimensions has an empty span) are ill-defined, and not an intended use case of `affine_grid`. Whereas before, all grid point components along such dimension were set arbitrarily to `-1` (that is, before multiplying be the affine matrix), they are now all set instead to `0`, which is a much more consistent and defensible arbitrary choice. A warning is triggered for such cases. #### Documentation - Update `affine_grid` documentation to express that it does indeed support 3D affine transforms. This support was already there but not documented. - Add documentation warnings for BC-breaking changes in `grid_sample` and `affine_grid` (see above). #### Refactors - `affine_grid` no longer dispatches to cuDNN under any circumstances. The decision point for when the cuDNN `affine_grid_generator` is compatible with the native PyTorch version and when it fails is a headache to maintain (see [these conditions](`5377478e94/torch/nn/_functions/vision.py (L7-L8)`)). The native PyTorch kernel is now used in all cases. - The kernels for `grid_sample` are slightly refactored to make maintenance easier. #### Tests Two new tests are added in `test_nn.py`: - `test_affine_grid_error_checking` for errors and warnings in `affine_grid` - `test_affine_grid_3D` for testing `affine_grid`'s 3D functionality. The functionality existed prior to this, but wasn't tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24929 Differential Revision: D16949064 Pulled By: ailzhang fbshipit-source-id: b133ce0d47a2a5b3e2140b9d05fb05fca9140926	2019-08-21 21:17:49 -07:00
Ailing Zhang	b0737ccdc1	Revert D16887357: [pytorch][PR] [BC-BREAKING] Add align_corners option to grid_sample and affine_grid, change default to False Differential Revision: D16887357 Original commit changeset: ea09aad7853e fbshipit-source-id: 0bebb159be4e6ebe479771b42c0b483f5a84a094	2019-08-19 22:05:56 -07:00
Barak Nehoran	87217cfd2a	Add align_corners option to grid_sample and affine_grid, change default to False (#23923 ) Summary: Resolves: https://github.com/pytorch/pytorch/issues/20785 Adds the `align_corners` option to `grid_sample` and `affine_grid`, paralleling the option that was added to `interpolate` in version 0.4.0. In short, setting `align_corners` to `False` allows these functions to be resolution agnostic. This ensures, for example, that a grid generated from a neural net trained to warp 1024x1024 images will also work to warp the same image upsampled/downsampled to other resolutions like 512x512 or 2048x2048 without producing scaling/stretching artifacts. Refer to the documentation and https://github.com/pytorch/pytorch/issues/20785 for more details. Important: BC-Breaking Change because of new default The old functionality can still be achieved by setting `align_corners=True`, but the default is now set to `align_corners=False`, since this is the more correct setting, and since this matches the default setting of `interpolate`. The vectorized 2D cpu version of `grid_sampler` is refactored a bit. I don’t suspect that this refactor would affect the runtime much, since it is mostly done in inlined functions, but I may be wrong, and this has to be verified by profiling. ~The tests are not yet updated to reflect the new default. New tests should probably also be added to test both settings of `align_corners`.~ _Tests are now updated._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/23923 Differential Revision: D16887357 Pulled By: ailzhang fbshipit-source-id: ea09aad7853ef16536e719a898db8ba31595daa5	2019-08-19 09:45:44 -07:00
Kexuan Sun	e2a6212912	Resolve unused variables in tests (#24075 ) Summary: Variables such as `device` and `sparse` in for loops should be used in tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24075 Differential Revision: D16763073 Pulled By: ezyang fbshipit-source-id: 8735cbc8d9ed695db8489cfc949c895180a7b826	2019-08-14 21:02:52 -07:00
Daya Khudia	f510409281	Enable FBGEMM tests under UBSAN as well (#23570 ) Summary: Enabling tests under UBSAN as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/23570 Test Plan: buck test mode/dev caffe2/test:quantized ``` Running 29 tests Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649677415136 ✓ caffe2/test:quantized - test_qtensor (test_quantized_tensor.TestQuantizedTensor) 0.536 1/29 (passed) ✓ caffe2/test:quantized - test_qtensor_per_channel_affine (test_quantized_tensor.TestQuantizedTensor) 0.453 2/29 (passed) ✓ caffe2/test:quantized - test_qtensor_reshape (test_quantized_tensor.TestQuantizedTensor) 0.302 3/29 (passed) ✓ caffe2/test:quantized - test_qadd_relu_same_qparams (test_quantized.TestQuantizedOps) 0.332 4/29 (passed) ✓ caffe2/test:quantized - test_qtensor_view (test_quantized_tensor.TestQuantizedTensor) 0.351 5/29 (passed) ✓ caffe2/test:quantized - test_qadd_relu_different_qparams (test_quantized.TestQuantizedOps) 0.348 6/29 (passed) ✓ caffe2/test:quantized - test_qtensor_dequantize_linear (test_quantized_tensor.TestQuantizedTensor) 0.338 7/29 (passed) ✓ caffe2/test:quantized - test_qtensor_copy (test_quantized_tensor.TestQuantizedTensor) 0.267 8/29 (passed) ✓ caffe2/test:quantized - test_qtensor_clone (test_quantized_tensor.TestQuantizedTensor) 0.330 9/29 (passed) ✓ caffe2/test:quantized - test_qrelu (test_quantized.TestQuantizedOps) 1.774 10/29 (passed) ✓ caffe2/test:quantized - test_pool_api (test_nn_quantized.ModuleAPITest) 0.418 11/29 (passed) ✓ caffe2/test:quantized - test_qtensor_load_save (test_quantized_tensor.TestQuantizedTensor) 0.724 12/29 (passed) ✓ caffe2/test:quantized - test_relu_api (test_nn_quantized.FunctionalAPITest) 1.013 13/29 (passed) ✓ caffe2/test:quantized - test_qtensor_quant_dequant (test_quantized_tensor.TestQuantizedTensor) 1.055 14/29 (passed) ✓ caffe2/test:quantized - test_qtensor_permute (test_quantized_tensor.TestQuantizedTensor) 0.696 15/29 (passed) ✓ caffe2/test:quantized - test_qtensor_dtypes (test_quantized_tensor.TestQuantizedTensor) 0.841 16/29 (passed) ✓ caffe2/test:quantized - test_quant_dequant_api (test_nn_quantized.ModuleAPITest) 0.616 17/29 (passed) ✓ caffe2/test:quantized - test_qtensor_creation (test_quantized_tensor.TestQuantizedTensor) 0.698 18/29 (passed) ✓ caffe2/test:quantized - test_qconv (test_quantized.TestQuantizedConv) 4.743 19/29 (passed) ✓ caffe2/test:quantized - test_cat (test_quantized.TestQuantizedOps) 6.992 20/29 (passed) ✓ caffe2/test:quantized - test_linear_api (test_nn_quantized.ModuleAPITest) 8.970 21/29 (passed) ✓ caffe2/test:quantized - test_conv_api (test_quantized_conv.QuantizedConvTest) 9.403 22/29 (passed) ↷ caffe2/test:quantized - test_qnnpack_linear (test_quantized.TestQNNPackOps) 0.000 23/29 (skipped) Test output: > Skipped: QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment. > test_qnnpack_linear (test_quantized.TestQNNPackOps) ... skipped 'QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment.' > > ---------------------------------------------------------------------- > Ran 1 test in 0.000s > > OK (skipped=1) ↷ caffe2/test:quantized - test_qnnpack_relu (test_quantized.TestQNNPackOps) 0.000 24/29 (skipped) Test output: > Skipped: QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment. > test_qnnpack_relu (test_quantized.TestQNNPackOps) ... skipped 'QNNPACK does not play well with UBSAN at the moment, so we skip the test if we are in a UBSAN environment.' > > ---------------------------------------------------------------------- > Ran 1 test in 0.000s > > OK (skipped=1) ✓ caffe2/test:quantized - test_max_pool2d (test_quantized.TestQuantizedOps) 8.453 25/29 (passed) ✓ caffe2/test:quantized - test_qlinear_unpack (test_quantized.TestQuantizedLinear) 0.664 26/29 (passed) ✓ caffe2/test:quantized - test_qconv_unpack (test_quantized.TestQuantizedConv) 2.965 27/29 (passed) ✓ caffe2/test:quantized - test_qlinear (test_quantized.TestQuantizedLinear) 1.915 28/29 (passed) ✓ caffe2/test:quantized - test_conv_api (test_nn_quantized.ModuleAPITest) 60.804 29/29 (passed) ✓ caffe2/test:quantized - main 0.000 (passed) Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/3940649677415136 Summary (total time 68.66s): PASS: 28 FAIL: 0 SKIP: 2 caffe2/test:quantized - test_qnnpack_linear (test_quantized.TestQNNPackOps) caffe2/test:quantized - test_qnnpack_relu (test_quantized.TestQNNPackOps) FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Reviewed By: jianyuh Differential Revision: D16569166 Pulled By: dskhudia fbshipit-source-id: 53522b4162eb1ebb35b408a1503d9664305c85b0	2019-08-12 17:59:22 -07:00
Thomas Viehmann	2e40857dad	Fix CTC loss for zero-length targets on GPU (#23298 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/18215 at last! Also sprinkle tests... Pull Request resolved: https://github.com/pytorch/pytorch/pull/23298 Differential Revision: D16582145 Pulled By: soumith fbshipit-source-id: bc8b1a629de0c2606e70a2218ccd135f4a9cdc5d	2019-07-31 12:03:45 -07:00

... 5 6 7 8 9 ...

1243 Commits