pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Natalia Gimelshein	0ac2986d33	Fixes softmax indexing for large tensors (#84182 ) Fixes #84144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84182 Approved by: https://github.com/janeyx99	2022-08-29 04:29:09 +00:00
PyTorch MergeBot	c7edcd6968	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `9790d90e4b`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487	2022-08-27 01:23:17 +00:00
Animesh Jain	6a58603956	Update Dynamo pin (#83829 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829 Approved by: https://github.com/ezyang	2022-08-26 20:49:43 +00:00
Edward Z. Yang	9790d90e4b	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-26 01:35:40 +00:00
zaf	2f04ba2c7c	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-25 16:50:38 +00:00
XiaobingSuper	a013597b32	fix oneDNN channels_last path issue (#83653 ) Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653 Approved by: https://github.com/mingfeima, https://github.com/malfet	2022-08-25 03:58:11 +00:00
PyTorch MergeBot	a7edf71360	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `8fae7027b3`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222	2022-08-25 00:49:40 +00:00
kshitij12345	7a8152530d	move pooling test from test_nn to test/nn/test_pooling (#83915 ) Ref #63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915 Approved by: https://github.com/albanD	2022-08-24 16:17:50 +00:00
Ishan-Rajgarhia	7fdc2f70c6	Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870 ) See https://github.com/pytorch/pytorch/issues/38095 Replaced assertEqualIgnoreType with assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870 Approved by: https://github.com/kit1980	2022-08-24 02:45:52 +00:00
Edward Z. Yang	8fae7027b3	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-23 22:04:07 +00:00
Khushi Agrawal	9095030239	[fix] edge case in `MaxPool1d` and add ErrorInputs (#83553 ) Fixes #83224 cc @kshitij12345 @albanD! Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553 Approved by: https://github.com/albanD	2022-08-23 19:23:39 +00:00
Kshiteej K	dd67d52b57	[nn] split rnn_utils test from test_nn.py (#83675 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Proposed folder structure ``` -> test -> nn -> test_conv.py -> test_pooling.py -> ..... ``` This PR: Moves test related RNN utilities to a different file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675 Approved by: https://github.com/albanD	2022-08-23 08:34:39 +00:00
XiaobingSuper	658f958bc4	fix upsample bf16 issue for channels last path by using high pricsion to compute index (#83847 ) Given the following case: ``` import torch a = torch.ones(1, 3, 320, 480).bfloat16().to(memory_format=torch.channels_last) out_bf16 = torch.nn.functional.interpolate(a, size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False) out_fp32= torch.nn.functional.interpolate(a.float(), size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False) print(out_bf16[0, 2, :, :]) print(out_fp32[0, 2, :, :]) ``` the boundary of bfloat16 output gets a wrong value: ``` tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], ..., [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [0.0000e+00, 0.0000e+00, 1.8367e-40, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00]], dtype=torch.bfloat16) tensor([[1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], ..., [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.]]) ``` the expected behavior is that the bfloat16 output value should also be one. The main reason is that we use low precision to compute the index, see `fcb124406b/aten/src/ATen/native/UpSample.h (L448)`, we should use a high precison to do the computation as GPU path: `fcb124406b/aten/src/ATen/native/cuda/UpSample.cuh (L123)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83847 Approved by: https://github.com/frank-wei	2022-08-23 00:53:37 +00:00
PyTorch MergeBot	4cbb1986fe	Revert "[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 )" This reverts commit `7cd2fa1d38`. Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted	2022-08-22 07:23:24 +00:00
zaf	7cd2fa1d38	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-22 05:33:23 +00:00
Rui Zhu	e0f2eba93d	Move odd num_head in TransformerEncoder to slow_path (#83483 ) Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path Test Plan: CI Differential Revision: D38720086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483 Approved by: https://github.com/erichan1	2022-08-20 10:02:08 +00:00
Jeff Daily	d52d2bd5a9	[ROCm] MIOpen fused convolution relu (#82002 ) Adds MIOpen fused convolution relu for fp32 and contiguous memory format. Adds fallbacks for conv + z + bias + relu, fp16, and channels last until MIOpen adds these features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82002 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-08-16 20:49:33 +00:00
Nicolas Macchioni	b236352036	Add mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947 Transformer fastpath multiplexes two arguments, src_mask [seq_len x seq_len] and src_key_padding_mask [batch_size x seq_len], and later deduces the type based on mask shape. In the event that batch_size == seq_len, any src_mask is wrongly interpreted as a src_key padding_mask. This is fixed by requiring a mask_type identifier be supplied whenever batch_size == seq_len. Additionally, added support for src_mask in masked_softmax CPU path. Test Plan: existing unit tests + new unit tests (batch_size == seq_len) Differential Revision: D37932240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947 Approved by: https://github.com/zrphercule	2022-08-09 23:42:16 +00:00
Sergii Dymchenko	7390ae837c	Resolve TODO for GroupNorm numerical issues (#82423 ) Looks like the numerical issues are resolved now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82423 Approved by: https://github.com/ngimel	2022-08-03 19:42:26 +00:00
Jiayi Sun	15a284b09e	optimize softmax backward and logsoftmax backward (#80114 ) Currently, if we run softmax_backward/logsoftmax_backward which are not along the last dim, the calculation will fall to a [scalar version](`32593ef2dd/aten/src/ATen/native/SoftMax.cpp (L220-L287)`). And we find actually we have the chance to vectorize the calculation along the inner_size dim. Changes we made: Use vectorized softmax_backward_kernel/log_softmax_backward_kernel instead of host_softmax_backward when not along the last dim. We collected the benchmark data of softmax_backward and logsoftmax_backward for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz. Number of cores: 24 cores(1 socket) [softmax_benchmark_32593ef.log](https://github.com/pytorch/pytorch/files/8962956/softmax_benchmark_32593ef.log) [softmax_benchmark_the_pr.log](https://github.com/pytorch/pytorch/files/8962958/softmax_benchmark_the_pr.log) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80114 Approved by: https://github.com/frank-wei	2022-08-03 00:36:28 +00:00
mingfeima	b019a41674	fix bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392 ) To fix https://github.com/pytorch/pytorch/issues/82060 When `input` is not explicitly converted to channels last while `conv` has, the output should also be in channels last. The root cause is that when input has IC of 1, `compute_columns2d` from `\aten\src\ATen\native\ConvolutionMM2d.cpp` would consider it as channels first: We do have logic to make sure both input and weight have the same memory format even if they are given differently, like: ``` auto input = self.contiguous(memory_format); auto weight = weight_.contiguous(memory_format); ``` But for a N1HW input, `.contiguous(MemoryFormat::ChannelsLast)` would not change its stride , and its `suggest_memory_format()` still returns `MemoryFormat::Contiguous`. That's how it went wrong. Also updated the corresponding test cases, without this patch, the new test case would fail on forward path and runtime error on backward path. attach old fail log on forward path: ``` FAIL: test_conv_thnn_nhwc_cpu_float32 (__main__.TestNNDeviceTypeCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test result = test(self, *param_kwargs) File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 974, in only_fn return fn(slf, args, **kwargs) File "test/test_nn.py", line 19487, in test_conv_thnn_nhwc input_format=torch.contiguous_format, weight_format=torch.channels_last) File "test/test_nn.py", line 19469, in helper self.assertEqual(out, ref_out, exact_dtype=False) File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2376, in assertEqual msg=(lambda generated_msg: f"{generated_msg} : {msg}") if isinstance(msg, str) and self.longMessage else msg, File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal raise error_metas[0].to_error(msg) AssertionError: Tensor-likes are not close! Mismatched elements: 988 / 1024 (96.5%) Greatest absolute difference: 42.0 at index (1, 2, 6, 6) (up to 1e-05 allowed) Greatest relative difference: inf at index (0, 0, 2, 1) (up to 1.3e-06 allowed) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/82392 Approved by: https://github.com/jbschlosser	2022-07-28 14:20:52 +00:00
Khushi Agrawal	050aec1805	[nn] add `pop` to sequential and ModuleList (#81601 ) Follows #71329 cc @kshitij12345! Pull Request resolved: https://github.com/pytorch/pytorch/pull/81601 Approved by: https://github.com/albanD	2022-07-25 19:32:32 +00:00
Ansh Radhakrishnan	110cd724fc	[nn] Add support for +=, * and *= operations for nn.Sequential objects (#81279 ) Fixes 71329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81279 Approved by: https://github.com/albanD	2022-07-25 15:48:47 +00:00
soulitzer	f595467e5c	Reenable slow gradcheck and make it pass (#80514 ) Context: For a while slow gradcheck CI was skipping nearly all tests and this hid the fact that it should've been failing and timing out (10+h runtime for TestGradients). The CI configuration has since been fixed to correct this, revealing the test failures. This PR reenables slow gradcheck CI and makes it pass again. This PR: - makes slow and failing tests run in fast gradcheck mode only - reduce the input size for slow gradcheck only for unary/binary ufuncs (alternatively, skip the test entirely) - skip entire test files on slow gradcheck runner if they don't use gradcheck (test_ops, test_meta, test_decomp, test_ops_jit) - reduces the input size for some ops Follow ups: 1. Investigate slow mode failures https://github.com/pytorch/pytorch/issues/80411 2. See if we can re-enable slow gradcheck tests for some of the slow tests by reducing the sizes of their inputs The following are failing in slow mode, they are now running in fast mode only. ``` test_fn_fwgrad_bwgrad___rmod___cuda_float64 test_fn_fwgrad_bwgrad_linalg_householder_product_cuda_complex128 test_fn_fwgrad_bwgrad__masked_prod_cuda_complex128 test_fn_fwgrad_bwgrad__masked_prod_cuda_float64 test_fn_fwgrad_bwgrad_linalg_matrix_power_cuda_complex128 test_fn_fwgrad_bwgrad_cat_cuda_complex128 test_fn_fwgrad_bwgrad_linalg_lu_factor_ex_cuda_float64 test_fn_fwgrad_bwgrad_copysign_cuda_float64 test_fn_fwgrad_bwgrad_cholesky_inverse_cuda_complex128 test_fn_fwgrad_bwgrad_float_power_cuda_complex128 test_fn_fwgrad_bwgrad_fmod_cuda_float64 test_fn_fwgrad_bwgrad_float_power_cuda_float64 test_fn_fwgrad_bwgrad_linalg_lu_cuda_float64 test_fn_fwgrad_bwgrad_remainder_cuda_float64 test_fn_fwgrad_bwgrad_repeat_cuda_complex128 test_fn_fwgrad_bwgrad_prod_cuda_complex128 test_fn_fwgrad_bwgrad_slice_scatter_cuda_float64 test_fn_fwgrad_bwgrad_tile_cuda_complex128 test_fn_fwgrad_bwgrad_pow_cuda_float64 test_fn_fwgrad_bwgrad_pow_cuda_complex128 test_fn_fwgrad_bwgrad_fft_* test_fn_fwgrad_bwgrad_zero__cuda_complex128 test_fn_gradgrad_linalg_lu_factor_cuda_float64 test_fn_grad_div_trunc_rounding_cuda_float64 test_fn_grad_div_floor_rounding_cuda_float64 ``` Marks the OpInfos for the following ops that run slowly in slow gradcheck as `fast_gradcheck` only (the left column represents runtime in seconds): ``` 0 918.722 test_fn_fwgrad_bwgrad_nn_functional_conv_transpose3d_cuda_float64 1 795.042 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_complex128 2 583.63 test_fn_fwgrad_bwgrad_nn_functional_max_pool3d_cuda_float64 3 516.946 test_fn_fwgrad_bwgrad_svd_cuda_complex128 4 503.179 test_fn_fwgrad_bwgrad_linalg_svd_cuda_complex128 5 460.985 test_fn_fwgrad_bwgrad_linalg_lu_cuda_complex128 6 401.04 test_fn_fwgrad_bwgrad_linalg_lstsq_grad_oriented_cuda_complex128 7 353.671 test_fn_fwgrad_bwgrad_nn_functional_max_pool2d_cuda_float64 8 321.903 test_fn_fwgrad_bwgrad_nn_functional_gaussian_nll_loss_cuda_float64 9 307.951 test_fn_fwgrad_bwgrad_stft_cuda_complex128 10 266.104 test_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64 11 221.032 test_fn_fwgrad_bwgrad_istft_cuda_complex128 12 183.741 test_fn_fwgrad_bwgrad_lu_unpack_cuda_complex128 13 132.019 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_float64 14 125.343 test_fn_fwgrad_bwgrad_nn_functional_pad_constant_cuda_complex128 15 124.2 test_fn_fwgrad_bwgrad_kron_cuda_complex128 16 123.721 test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64 17 121.074 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64 18 119.387 test_fn_fwgrad_bwgrad_rot90_cuda_complex128 19 112.889 test_fn_fwgrad_bwgrad__masked_normalize_cuda_complex128 20 107.541 test_fn_fwgrad_bwgrad_dist_cuda_complex128 21 106.727 test_fn_fwgrad_bwgrad_diff_cuda_complex128 22 104.588 test_fn_fwgrad_bwgrad__masked_cumprod_cuda_complex128 23 100.135 test_fn_fwgrad_bwgrad_nn_functional_feature_alpha_dropout_with_train_cuda_float64 24 88.359 test_fn_fwgrad_bwgrad_mH_cuda_complex128 25 86.214 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64 26 83.037 test_fn_fwgrad_bwgrad_nn_functional_bilinear_cuda_float64 27 79.987 test_fn_fwgrad_bwgrad__masked_cumsum_cuda_complex128 28 77.822 test_fn_fwgrad_bwgrad_diag_embed_cuda_complex128 29 76.256 test_fn_fwgrad_bwgrad_mT_cuda_complex128 30 74.039 test_fn_fwgrad_bwgrad_linalg_lu_solve_cuda_complex128 ``` ``` 0 334.142 test_fn_fwgrad_bwgrad_unfold_cuda_complex128 1 312.791 test_fn_fwgrad_bwgrad_linalg_lu_factor_cuda_complex128 2 121.963 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64 3 108.085 test_fn_fwgrad_bwgrad_diff_cuda_complex128 4 89.418 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64 5 72.231 test_fn_fwgrad_bwgrad___rdiv___cuda_complex128 6 69.433 test_fn_fwgrad_bwgrad___getitem___cuda_complex128 7 68.582 test_fn_fwgrad_bwgrad_ldexp_cuda_complex128 8 68.572 test_fn_fwgrad_bwgrad_linalg_pinv_cuda_complex128 9 67.585 test_fn_fwgrad_bwgrad_nn_functional_glu_cuda_float64 10 66.567 test_fn_fwgrad_bwgrad_lu_cuda_float64 ``` ``` 0 630.13 test_fn_gradgrad_nn_functional_conv2d_cuda_complex128 1 81.086 test_fn_gradgrad_linalg_solve_triangular_cuda_complex128 2 71.332 test_fn_gradgrad_norm_cuda_complex128 3 64.308 test_fn_gradgrad__masked_std_cuda_complex128 4 59.519 test_fn_gradgrad_div_no_rounding_mode_cuda_complex128 5 58.836 test_fn_gradgrad_nn_functional_adaptive_avg_pool3 ``` Reduces the sizes of the inputs for: - diff - diag_embed Pull Request resolved: https://github.com/pytorch/pytorch/pull/80514 Approved by: https://github.com/albanD	2022-07-22 02:05:37 +00:00
Saketh Are	445ee5620e	Simplify torch.nn.grad by calling into aten::convolution_backward (#81839 ) `torch.nn.grad` has its own implementations of gradients for conv1d, conv2d, and conv3d. This PR simplifies them by calling into the unified `aten::convolution_backward` backend instead. The existing implementation of conv2d_weight is incorrect for some inputs (see issue #51430). This PR fixes the issue. This PR expands coverage in test_nn to include conv1d_weight, conv2d_weight, and conv3d_weight, which were previously untested. It also expands the cases for conv2d to cover issue #51430. Fixes #51430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81839 Approved by: https://github.com/albanD	2022-07-21 19:34:27 +00:00
Khushi Agrawal	dced803339	[nn] add `insert` method to sequential class (#81402 ) Follows #71329 cc @kshitij12345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81402 Approved by: https://github.com/albanD	2022-07-20 14:45:52 +00:00
Khushi Agrawal	2c0b11b43b	[nn] implement `extend` method to sequential class (#81179 ) Follows #71329 cc @kshitij12345 :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/81179 Approved by: https://github.com/albanD	2022-07-20 05:33:41 +00:00
PyTorch MergeBot	f82b19f15b	Revert "Disable use_mkldnn when input is not contiguous for oneDNN (#80864 )" This reverts commit `4655c3bace`. Reverted https://github.com/pytorch/pytorch/pull/80864 on behalf of https://github.com/janeyx99 due to Reverting due for a perf regression https://github.com/pytorch/benchmark/issues/1040	2022-07-19 18:58:52 +00:00
yanbing-j	4655c3bace	Disable use_mkldnn when input is not contiguous for oneDNN (#80864 ) Fixes [#80837](https://github.com/pytorch/pytorch/issues/80837). This PR is to disable use_mkldnn when input is not contiguous for oneDNN requirement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80864 Approved by: https://github.com/malfet	2022-07-17 14:58:26 +00:00
Rui Zhu	b22166fd62	Add a small fastpath test for native mha (#81432 ) Summary: We dont have a small fast path passing test for mha before, this diff added one for better testing Test Plan: buck build mode/dev-nosan -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/dev/gen/caffe2/test/nn\#binary.par -r test_multihead_attn_fast_path_small_test Differential Revision: D37834319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81432 Approved by: https://github.com/erichan1	2022-07-15 23:54:40 +00:00
Eric Han	23088fcfdf	disable src mask for transformer and multiheadattention fastpath (#81277 ) Disable fastpath if src_mask passed to TransformerEncoderLayer and MultiheadAttention. - Refactored test_transformerencoder from test_nn.py to test_transformers.py. Added a src_mask test there. - Added a specific src_mask test in test_transformers.py Fixes https://github.com/pytorch/pytorch/issues/81129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81277 Approved by: https://github.com/zrphercule	2022-07-15 20:55:17 +00:00
n.zhuravlev	7af0200a46	Add deepcopy functionality to parametrized modules (#80811 ) Fixes #69413 After applying parametrization to any `nn.Module` we lose the ability to create a deepcopy of it e.g. it makes it impossible to wrap a module by an `AveragedModel`. Specifically, the problem is that the `deepcopy` tries to invoke `__getstate__` if object hasn't implemented its own `__deepcopy__` magic method. But we don't allow serialization of the parametrized modules: `__getstate__` raises an error. My solution is just to create a default `__deepcopy__` method when it doesn't exist yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80811 Approved by: https://github.com/pearu, https://github.com/albanD	2022-07-15 09:06:45 +00:00
Khushi Agrawal	3da8c909da	[nn] add `+` operator for torch.nn.Sequential to concatenate (#81170 ) Fixes #78512 #### TODO - [x] add tests cc @kshitij12345! Pull Request resolved: https://github.com/pytorch/pytorch/pull/81170 Approved by: https://github.com/albanD	2022-07-11 17:49:58 +00:00
eqy	3b78c5682b	Don't implicitly convert to channels-first in MaxPool3D on CUDA (#80748 ) MaxPool3D currently converts inputs implicitly to channels-first (via `.contiguous()`) which may yield unexpected regressions in workloads that expect a full channels-last path. This PR preserves the channels-last format in MaxPool3D while attempting to avoid seriously regressing performance. Currently, typical case (kernel size == 2 == stride) looks good, but larger kernel sizes (>4) or the unusual case of stride 1 can sometimes be slower than converting to channels-first before doing MaxPool3D. Additionally, this PR adds a test for 64bit-indexing backwards as testing of these changes uncovered an IMA for large tensors when doing the backwards pass with MaxPool3D. Performance comparison on A6000: ``` [------------------------------------- max_pool3d ---------------------------------------------------------] \| channels_last=False \| curr ch_last=True \| new ch_last=True 1 threads: ---------------------------------------------------------------------------- --------------------- [64, 256, 32, 32, 32] 4x4 stride 4 \| 20093.5 \| 34823.4 \| 20640.0 [64, 256, 32, 32, 32] 4x4 stride 2 \| 28623.7 \| 42625.6 \| 27935.5 [64, 256, 32, 32, 32] 4x4 stride 1 \| 68177.5 \| 79147.2 \| 85604.8 [64, 256, 32, 32, 32] 2x2 stride 4 \| 17237.7 \| 32071.3 \| 16641.6 [64, 256, 32, 32, 32] 2x2 stride 2 \| 25252.5 \| 39993.2 \| 25054.8 [64, 256, 32, 32, 32] 2x2 stride 1 \| 43185.2 \| 58164.6 \| 48416.9 [64, 256, 16, 16, 16] 4x4 stride 4 \| 3017.7 \| 3952.4 \| 2593.8 [64, 256, 16, 16, 16] 4x4 stride 2 \| 4581.5 \| 5384.3 \| 3294.3 [64, 256, 16, 16, 16] 4x4 stride 1 \| 11334.1 \| 11534.7 \| 8651.1 [64, 256, 16, 16, 16] 2x2 stride 4 \| 2346.9 \| 3304.6 \| 2098.8 [64, 256, 16, 16, 16] 2x2 stride 2 \| 3550.8 \| 4526.5 \| 3143.6 [64, 256, 16, 16, 16] 2x2 stride 1 \| 6898.1 \| 7816.0 \| 5820.8 [64, 256, 4, 4, 4] 4x4 stride 4 \| 191.5 \| 176.3 \| 77.5 [64, 256, 4, 4, 4] 4x4 stride 2 \| 191.8 \| 176.8 \| 94.1 [64, 256, 4, 4, 4] 4x4 stride 1 \| 191.3 \| 176.4 \| 97.3 [64, 256, 4, 4, 4] 2x2 stride 4 \| 96.4 \| 114.4 \| 93.6 [64, 256, 4, 4, 4] 2x2 stride 2 \| 172.1 \| 178.6 \| 93.7 [64, 256, 4, 4, 4] 2x2 stride 1 \| 263.0 \| 279.4 \| 92.4 [64, 64, 32, 32, 32] 4x4 stride 4 \| 5033.2 \| 7208.3 \| 5167.5 [64, 64, 32, 32, 32] 4x4 stride 2 \| 7216.1 \| 9218.7 \| 6637.1 [64, 64, 32, 32, 32] 4x4 stride 1 \| 17192.1 \| 18392.9 \| 20489.0 [64, 64, 32, 32, 32] 2x2 stride 4 \| 4318.0 \| 6511.2 \| 4193.1 [64, 64, 32, 32, 32] 2x2 stride 2 \| 6324.4 \| 8657.7 \| 6263.6 [64, 64, 32, 32, 32] 2x2 stride 1 \| 10855.0 \| 13040.2 \| 12055.9 [64, 64, 16, 16, 16] 4x4 stride 4 \| 764.1 \| 975.6 \| 671.3 [64, 64, 16, 16, 16] 4x4 stride 2 \| 1163.1 \| 1333.4 \| 833.6 [64, 64, 16, 16, 16] 4x4 stride 1 \| 2890.0 \| 2898.5 \| 2209.8 [64, 64, 16, 16, 16] 2x2 stride 4 \| 593.5 \| 811.2 \| 536.3 [64, 64, 16, 16, 16] 2x2 stride 2 \| 895.9 \| 1112.3 \| 794.5 [64, 64, 16, 16, 16] 2x2 stride 1 \| 1742.5 \| 1968.0 \| 1475.2 [64, 64, 4, 4, 4] 4x4 stride 4 \| 101.1 \| 112.2 \| 93.4 [64, 64, 4, 4, 4] 4x4 stride 2 \| 96.7 \| 114.6 \| 92.5 [64, 64, 4, 4, 4] 4x4 stride 1 \| 98.9 \| 111.9 \| 96.5 [64, 64, 4, 4, 4] 2x2 stride 4 \| 100.1 \| 107.1 \| 94.2 [64, 64, 4, 4, 4] 2x2 stride 2 \| 96.6 \| 108.0 \| 94.5 [64, 64, 4, 4, 4] 2x2 stride 1 \| 96.7 \| 107.9 \| 95.2 [64, 3, 32, 32, 32] 4x4 stride 4 \| 250.1 \| 326.6 \| 278.0 [64, 3, 32, 32, 32] 4x4 stride 2 \| 350.4 \| 414.0 \| 323.2 [64, 3, 32, 32, 32] 4x4 stride 1 \| 825.6 \| 846.9 \| 982.5 [64, 3, 32, 32, 32] 2x2 stride 4 \| 213.3 \| 289.8 \| 219.9 [64, 3, 32, 32, 32] 2x2 stride 2 \| 308.2 \| 384.9 \| 305.9 [64, 3, 32, 32, 32] 2x2 stride 1 \| 523.5 \| 594.7 \| 589.9 [64, 3, 16, 16, 16] 4x4 stride 4 \| 103.8 \| 116.7 \| 93.0 [64, 3, 16, 16, 16] 4x4 stride 2 \| 100.9 \| 108.3 \| 93.3 [64, 3, 16, 16, 16] 4x4 stride 1 \| 139.4 \| 140.7 \| 104.8 [64, 3, 16, 16, 16] 2x2 stride 4 \| 97.5 \| 114.7 \| 92.7 [64, 3, 16, 16, 16] 2x2 stride 2 \| 97.4 \| 108.8 \| 91.7 [64, 3, 16, 16, 16] 2x2 stride 1 \| 99.9 \| 108.0 \| 94.1 [64, 3, 4, 4, 4] 4x4 stride 4 \| 97.2 \| 110.2 \| 94.7 [64, 3, 4, 4, 4] 4x4 stride 2 \| 105.7 \| 107.4 \| 92.8 [64, 3, 4, 4, 4] 4x4 stride 1 \| 98.0 \| 110.0 \| 93.7 [64, 3, 4, 4, 4] 2x2 stride 4 \| 98.3 \| 116.7 \| 93.0 [64, 3, 4, 4, 4] 2x2 stride 2 \| 98.6 \| 107.5 \| 92.8 [64, 3, 4, 4, 4] 2x2 stride 1 \| 100.6 \| 110.3 \| 94.0 [16, 256, 32, 32, 32] 4x4 stride 4 \| 5034.2 \| 8838.0 \| 5165.9 [16, 256, 32, 32, 32] 4x4 stride 2 \| 7236.3 \| 10869.9 \| 7038.2 [16, 256, 32, 32, 32] 4x4 stride 1 \| 17385.4 \| 21401.6 \| 21900.7 [16, 256, 32, 32, 32] 2x2 stride 4 \| 4318.7 \| 8101.2 \| 4172.9 [16, 256, 32, 32, 32] 2x2 stride 2 \| 6324.0 \| 10147.5 \| 6279.7 [16, 256, 32, 32, 32] 2x2 stride 1 \| 10899.7 \| 14826.0 \| 12256.3 [16, 256, 16, 16, 16] 4x4 stride 4 \| 765.4 \| 1012.7 \| 675.6 [16, 256, 16, 16, 16] 4x4 stride 2 \| 1162.8 \| 1376.9 \| 843.4 [16, 256, 16, 16, 16] 4x4 stride 1 \| 2928.9 \| 2969.8 \| 2222.5 [16, 256, 16, 16, 16] 2x2 stride 4 \| 593.5 \| 845.8 \| 534.2 [16, 256, 16, 16, 16] 2x2 stride 2 \| 896.9 \| 1152.2 \| 796.9 [16, 256, 16, 16, 16] 2x2 stride 1 \| 1750.2 \| 2009.4 \| 1481.8 [16, 256, 4, 4, 4] 4x4 stride 4 \| 96.6 \| 107.1 \| 92.7 [16, 256, 4, 4, 4] 4x4 stride 2 \| 97.9 \| 114.9 \| 93.8 [16, 256, 4, 4, 4] 4x4 stride 1 \| 98.2 \| 115.6 \| 94.0 [16, 256, 4, 4, 4] 2x2 stride 4 \| 97.0 \| 106.7 \| 93.8 [16, 256, 4, 4, 4] 2x2 stride 2 \| 96.8 \| 108.1 \| 93.3 [16, 256, 4, 4, 4] 2x2 stride 1 \| 95.8 \| 120.9 \| 95.7 [16, 64, 32, 32, 32] 4x4 stride 4 \| 1266.4 \| 1815.4 \| 1312.3 [16, 64, 32, 32, 32] 4x4 stride 2 \| 1818.5 \| 2328.0 \| 1678.9 [16, 64, 32, 32, 32] 4x4 stride 1 \| 4352.9 \| 4649.3 \| 5204.6 [16, 64, 32, 32, 32] 2x2 stride 4 \| 1090.0 \| 1631.2 \| 1060.8 [16, 64, 32, 32, 32] 2x2 stride 2 \| 1589.4 \| 2141.1 \| 1576.4 [16, 64, 32, 32, 32] 2x2 stride 1 \| 2733.5 \| 3286.0 \| 3041.6 [16, 64, 16, 16, 16] 4x4 stride 4 \| 201.7 \| 259.6 \| 175.0 [16, 64, 16, 16, 16] 4x4 stride 2 \| 301.0 \| 350.1 \| 226.3 [16, 64, 16, 16, 16] 4x4 stride 1 \| 740.1 \| 748.7 \| 570.6 [16, 64, 16, 16, 16] 2x2 stride 4 \| 156.0 \| 214.8 \| 140.8 [16, 64, 16, 16, 16] 2x2 stride 2 \| 232.3 \| 292.3 \| 208.7 [16, 64, 16, 16, 16] 2x2 stride 1 \| 449.1 \| 504.0 \| 382.1 [16, 64, 4, 4, 4] 4x4 stride 4 \| 97.5 \| 111.4 \| 94.5 [16, 64, 4, 4, 4] 4x4 stride 2 \| 98.8 \| 111.9 \| 94.4 [16, 64, 4, 4, 4] 4x4 stride 1 \| 98.2 \| 112.0 \| 95.2 [16, 64, 4, 4, 4] 2x2 stride 4 \| 99.7 \| 111.0 \| 94.0 [16, 64, 4, 4, 4] 2x2 stride 2 \| 100.3 \| 110.0 \| 93.2 [16, 64, 4, 4, 4] 2x2 stride 1 \| 97.5 \| 107.6 \| 93.5 [16, 3, 32, 32, 32] 4x4 stride 4 \| 100.5 \| 117.1 \| 95.7 [16, 3, 32, 32, 32] 4x4 stride 2 \| 97.5 \| 121.3 \| 92.5 [16, 3, 32, 32, 32] 4x4 stride 1 \| 216.0 \| 227.4 \| 258.4 [16, 3, 32, 32, 32] 2x2 stride 4 \| 97.1 \| 109.0 \| 91.9 [16, 3, 32, 32, 32] 2x2 stride 2 \| 95.8 \| 108.5 \| 92.9 [16, 3, 32, 32, 32] 2x2 stride 1 \| 139.4 \| 161.2 \| 157.8 [16, 3, 16, 16, 16] 4x4 stride 4 \| 96.4 \| 113.6 \| 91.9 [16, 3, 16, 16, 16] 4x4 stride 2 \| 97.4 \| 108.1 \| 93.5 [16, 3, 16, 16, 16] 4x4 stride 1 \| 99.0 \| 107.5 \| 92.1 [16, 3, 16, 16, 16] 2x2 stride 4 \| 96.9 \| 118.1 \| 93.4 [16, 3, 16, 16, 16] 2x2 stride 2 \| 97.3 \| 106.7 \| 95.8 [16, 3, 16, 16, 16] 2x2 stride 1 \| 98.8 \| 109.2 \| 93.8 [16, 3, 4, 4, 4] 4x4 stride 4 \| 97.8 \| 108.0 \| 94.2 [16, 3, 4, 4, 4] 4x4 stride 2 \| 92.7 \| 108.0 \| 93.9 [16, 3, 4, 4, 4] 4x4 stride 1 \| 97.8 \| 107.6 \| 93.5 [16, 3, 4, 4, 4] 2x2 stride 4 \| 100.3 \| 107.7 \| 94.3 [16, 3, 4, 4, 4] 2x2 stride 2 \| 97.2 \| 107.5 \| 96.1 [16, 3, 4, 4, 4] 2x2 stride 1 \| 98.1 \| 111.1 \| 93.8 Times are in microseconds (us). ``` Performance comparison on V100: (these times have been updated after working around some noisy measurements in my setup) ``` [------------------------------------- max_pool3d ---------------------------------------------------------] \| channels_last=False \| curr ch_last=True \| new ch_last=True 1 threads: ------------------------------------------------------------------------------------------------- [64, 256, 32, 32, 32] 4x4 stride 4 \| 15810.7 \| 33807.7 \| 16452.9 [64, 256, 32, 32, 32] 4x4 stride 2 \| 24422.7 \| 42515.3 \| 27700.3 [64, 256, 32, 32, 32] 4x4 stride 1 \| 71756.0 \| 89916.5 \| 106464.0 [64, 256, 32, 32, 32] 2x2 stride 4 \| 12102.9 \| 30210.4 \| 11319.8 [64, 256, 32, 32, 32] 2x2 stride 2 \| 19101.7 \| 37210.8 \| 20373.3 [64, 256, 32, 32, 32] 2x2 stride 1 \| 41418.0 \| 59650.5 \| 53009.2 [64, 256, 16, 16, 16] 4x4 stride 4 \| 2362.0 \| 4210.3 \| 2114.0 [64, 256, 16, 16, 16] 4x4 stride 2 \| 4102.4 \| 5897.4 \| 3179.7 [64, 256, 16, 16, 16] 4x4 stride 1 \| 11339.3 \| 13116.6 \| 10032.6 [64, 256, 16, 16, 16] 2x2 stride 4 \| 1709.7 \| 3506.7 \| 1423.6 [64, 256, 16, 16, 16] 2x2 stride 2 \| 2966.6 \| 4760.8 \| 2499.3 [64, 256, 16, 16, 16] 2x2 stride 1 \| 6998.4 \| 8797.3 \| 6152.0 [64, 256, 4, 4, 4] 4x4 stride 4 \| 173.0 \| 176.3 \| 127.9 [64, 256, 4, 4, 4] 4x4 stride 2 \| 149.1 \| 176.3 \| 125.5 [64, 256, 4, 4, 4] 4x4 stride 1 \| 150.0 \| 177.2 \| 125.6 [64, 256, 4, 4, 4] 2x2 stride 4 \| 158.0 \| 192.7 \| 127.9 [64, 256, 4, 4, 4] 2x2 stride 2 \| 169.7 \| 199.2 \| 125.3 [64, 256, 4, 4, 4] 2x2 stride 1 \| 289.6 \| 318.2 \| 116.5 [64, 64, 32, 32, 32] 4x4 stride 4 \| 3914.4 \| 6993.3 \| 4141.4 [64, 64, 32, 32, 32] 4x4 stride 2 \| 6107.4 \| 9186.4 \| 6378.5 [64, 64, 32, 32, 32] 4x4 stride 1 \| 17920.0 \| 20993.5 \| 23891.1 [64, 64, 32, 32, 32] 2x2 stride 4 \| 3029.7 \| 6112.6 \| 2895.6 [64, 64, 32, 32, 32] 2x2 stride 2 \| 4787.8 \| 7870.6 \| 4724.8 [64, 64, 32, 32, 32] 2x2 stride 1 \| 10366.4 \| 13446.4 \| 12603.8 [64, 64, 16, 16, 16] 4x4 stride 4 \| 605.8 \| 962.9 \| 499.7 [64, 64, 16, 16, 16] 4x4 stride 2 \| 1037.0 \| 1394.8 \| 791.6 [64, 64, 16, 16, 16] 4x4 stride 1 \| 2835.4 \| 3191.8 \| 2484.3 [64, 64, 16, 16, 16] 2x2 stride 4 \| 438.6 \| 795.7 \| 368.6 [64, 64, 16, 16, 16] 2x2 stride 2 \| 749.1 \| 1108.0 \| 612.0 [64, 64, 16, 16, 16] 2x2 stride 1 \| 1756.4 \| 2112.2 \| 1538.5 [64, 64, 4, 4, 4] 4x4 stride 4 \| 132.6 \| 163.9 \| 115.4 [64, 64, 4, 4, 4] 4x4 stride 2 \| 129.3 \| 153.7 \| 117.8 [64, 64, 4, 4, 4] 4x4 stride 1 \| 128.0 \| 153.8 \| 117.6 [64, 64, 4, 4, 4] 2x2 stride 4 \| 128.2 \| 154.1 \| 117.5 [64, 64, 4, 4, 4] 2x2 stride 2 \| 130.5 \| 157.3 \| 117.6 [64, 64, 4, 4, 4] 2x2 stride 1 \| 128.8 \| 156.4 \| 120.6 [64, 3, 32, 32, 32] 4x4 stride 4 \| 200.4 \| 261.0 \| 228.8 [64, 3, 32, 32, 32] 4x4 stride 2 \| 305.3 \| 366.5 \| 344.4 [64, 3, 32, 32, 32] 4x4 stride 1 \| 860.9 \| 922.1 \| 1136.0 [64, 3, 32, 32, 32] 2x2 stride 4 \| 157.0 \| 216.9 \| 158.1 [64, 3, 32, 32, 32] 2x2 stride 2 \| 240.5 \| 300.9 \| 247.7 [64, 3, 32, 32, 32] 2x2 stride 1 \| 503.5 \| 565.1 \| 609.8 [64, 3, 16, 16, 16] 4x4 stride 4 \| 136.0 \| 159.0 \| 120.3 [64, 3, 16, 16, 16] 4x4 stride 2 \| 131.2 \| 156.9 \| 120.0 [64, 3, 16, 16, 16] 4x4 stride 1 \| 146.6 \| 158.5 \| 123.8 [64, 3, 16, 16, 16] 2x2 stride 4 \| 133.8 \| 158.4 \| 117.1 [64, 3, 16, 16, 16] 2x2 stride 2 \| 132.1 \| 160.8 \| 117.9 [64, 3, 16, 16, 16] 2x2 stride 1 \| 133.7 \| 174.4 \| 118.0 [64, 3, 4, 4, 4] 4x4 stride 4 \| 156.8 \| 166.2 \| 119.4 [64, 3, 4, 4, 4] 4x4 stride 2 \| 126.8 \| 150.4 \| 118.2 [64, 3, 4, 4, 4] 4x4 stride 1 \| 125.2 \| 151.7 \| 117.8 [64, 3, 4, 4, 4] 2x2 stride 4 \| 127.3 \| 152.7 \| 116.2 [64, 3, 4, 4, 4] 2x2 stride 2 \| 128.6 \| 153.3 \| 114.6 [64, 3, 4, 4, 4] 2x2 stride 1 \| 128.6 \| 153.5 \| 114.7 [16, 256, 32, 32, 32] 4x4 stride 4 \| 3921.7 \| 8445.7 \| 4064.7 [16, 256, 32, 32, 32] 4x4 stride 2 \| 6111.7 \| 10630.0 \| 6944.4 [16, 256, 32, 32, 32] 4x4 stride 1 \| 17938.9 \| 22896.8 \| 26648.7 [16, 256, 32, 32, 32] 2x2 stride 4 \| 3029.6 \| 7552.7 \| 2840.9 [16, 256, 32, 32, 32] 2x2 stride 2 \| 4788.0 \| 9322.1 \| 5110.5 [16, 256, 32, 32, 32] 2x2 stride 1 \| 10363.7 \| 14885.9 \| 13213.6 [16, 256, 16, 16, 16] 4x4 stride 4 \| 606.0 \| 1059.1 \| 535.9 [16, 256, 16, 16, 16] 4x4 stride 2 \| 1037.5 \| 1491.5 \| 822.3 [16, 256, 16, 16, 16] 4x4 stride 1 \| 2835.4 \| 3306.8 \| 2522.8 [16, 256, 16, 16, 16] 2x2 stride 4 \| 438.6 \| 892.3 \| 369.0 [16, 256, 16, 16, 16] 2x2 stride 2 \| 749.2 \| 1203.7 \| 638.7 [16, 256, 16, 16, 16] 2x2 stride 1 \| 1756.1 \| 2212.5 \| 1547.0 [16, 256, 4, 4, 4] 4x4 stride 4 \| 159.6 \| 187.6 \| 117.6 [16, 256, 4, 4, 4] 4x4 stride 2 \| 161.1 \| 185.5 \| 117.3 [16, 256, 4, 4, 4] 4x4 stride 1 \| 160.0 \| 148.1 \| 117.8 [16, 256, 4, 4, 4] 2x2 stride 4 \| 123.9 \| 148.3 \| 117.6 [16, 256, 4, 4, 4] 2x2 stride 2 \| 126.0 \| 151.7 \| 117.4 [16, 256, 4, 4, 4] 2x2 stride 1 \| 127.1 \| 152.3 \| 117.9 [16, 64, 32, 32, 32] 4x4 stride 4 \| 983.5 \| 1756.7 \| 1067.8 [16, 64, 32, 32, 32] 4x4 stride 2 \| 1542.4 \| 2315.2 \| 1621.5 [16, 64, 32, 32, 32] 4x4 stride 1 \| 4498.7 \| 5273.4 \| 6006.7 [16, 64, 32, 32, 32] 2x2 stride 4 \| 767.2 \| 1543.4 \| 736.7 [16, 64, 32, 32, 32] 2x2 stride 2 \| 1207.8 \| 1981.5 \| 1197.0 [16, 64, 32, 32, 32] 2x2 stride 1 \| 2603.3 \| 3367.5 \| 3161.9 [16, 64, 16, 16, 16] 4x4 stride 4 \| 169.5 \| 264.6 \| 142.8 [16, 64, 16, 16, 16] 4x4 stride 2 \| 274.6 \| 368.9 \| 216.8 [16, 64, 16, 16, 16] 4x4 stride 1 \| 723.3 \| 820.4 \| 643.2 [16, 64, 16, 16, 16] 2x2 stride 4 \| 131.4 \| 216.0 \| 116.1 [16, 64, 16, 16, 16] 2x2 stride 2 \| 199.9 \| 295.0 \| 166.8 ``` CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/80748 Approved by: https://github.com/ngimel	2022-07-08 04:26:01 +00:00
Michael Gschwind	25449292a0	Run mask test with and without nested tensor (#81008 ) Summary: Run mask test with and without nested tensor Test Plan: sandcastle Differential Revision: D37665532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81008 Approved by: https://github.com/malfet	2022-07-07 23:54:37 +00:00
Animesh Jain	1d90d6ee60	Setup for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106 ) @ezyang I am going to keep adding more skips in this PR for now. And once we have the CI running, I will replace with the appropriate decorators. cc @mlazos , we should add those tests in test_ops.py in this PR as well cc @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/80106 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-07-07 18:57:33 +00:00
albanD	c8d64ba5ec	Allow register float16 weight_norm on cpu and speed up test (#80600 ) Fixes https://github.com/pytorch/pytorch/issues/80599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80600 Approved by: https://github.com/malfet	2022-06-30 13:50:39 +00:00
otaj	db52e4b7d9	Bugfix/weakref (#80139 ) Fixes #78580 I'm back! :) cc @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/80139 Approved by: https://github.com/albanD	2022-06-28 14:51:42 +00:00
Rohit Goswami	72e40d2bc7	BUG: Evade segfault by throwing a RuntimeError for `nn.ChannelShuffle` and empty input tensors (#77029 ) Fixes #76616. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77029 Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser	2022-06-23 21:14:02 +00:00
Michael Gschwind	bcc02769be	Check for contiguous well-formed mask (#79927 ) Summary: Check for contiguous well-formed mask Test Plan: sandcastle, github CI Reviewed By: frank-wei Differential Revision: D37301243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79927 Approved by: https://github.com/jbschlosser	2022-06-23 15:41:04 +00:00
Alex Hedges	cb2b7b1e57	Fix code that triggers BytesWarning (#79868 ) Fixes #74812. I have fixed the multiple instances in the repository that trigger `BytesWarning`, and I have enabled the `-bb` option when tests are run to prevent regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79868 Approved by: https://github.com/janeyx99	2022-06-21 01:12:21 +00:00
Joel Benjamin Schlosser	5953fd9133	Revert behavior of Dropout2d on 3D inputs to 1D channel-wise dropout behavior & warn Pull Request resolved: https://github.com/pytorch/pytorch/pull/79549 Approved by: https://github.com/ngimel, https://github.com/albanD	2022-06-15 14:56:43 +00:00
Joel Benjamin Schlosser	2d73c8e6e0	Add Dropout1d module Pull Request resolved: https://github.com/pytorch/pytorch/pull/79545 Approved by: https://github.com/ngimel, https://github.com/albanD	2022-06-15 14:39:07 +00:00
Kurt Mohler	4cfd09d7bc	Reland: Add index value checking to MaxUnpool2d and MaxUnpool3d (#78280 ) Relanding #70545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78280 Approved by: https://github.com/jbschlosser	2022-06-03 20:09:07 +00:00
samdow	b7cb4eae6b	Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560 ) On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](`369d9f4137`) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked. What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560 Approved by: https://github.com/soulitzer, https://github.com/albanD	2022-06-03 19:14:51 +00:00
Eddie Yan	14b0e9e75f	[cuDNN] Don't enforce bitwise exact results in `test_conv_transposed_large_cuda` (#78147 ) `test_conv_transposed_large` expects bitwise perfect results in fp16 on CUDA, but this behavior isn't guaranteed by cuDNN (e.g., in the case of FFT algos). This PR just changes the tolerance on the test to account for these cases. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/78147 Approved by: https://github.com/ngimel	2022-06-03 19:03:24 +00:00
Eddie Yan	b740a99b9e	[cuDNN][TF32] Threshold adjustments for TF32 on `>=sm80` (#78437 ) CC @ptrblck @mcarilli Change to transformer multilayer test can potentially be swapped in favor of an rtol change? (see also: #75612). Pull Request resolved: https://github.com/pytorch/pytorch/pull/78437 Approved by: https://github.com/ngimel	2022-06-03 01:02:56 +00:00
PyTorch MergeBot	d578197747	Revert "Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560 )" This reverts commit `ce7c7bb2a9`. Reverted https://github.com/pytorch/pytorch/pull/78560 on behalf of https://github.com/malfet due to broke XLA (on CI and trunk), see `ce7c7bb2a9`	2022-06-02 17:40:34 +00:00
samdow	ce7c7bb2a9	Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560 ) On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](`369d9f4137`) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked. What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560 Approved by: https://github.com/soulitzer, https://github.com/albanD	2022-06-02 13:40:21 +00:00
Edward Z. Yang	c20969c40c	Fix ParameterList printing meta tensor Fixes https://github.com/pytorch/pytorch/issues/78250 There are actually two bugs. First, the crash is caused by TensorOptions::backend incorrectly reporting noexcept when it can failed. Second, ParameterList is using torch.tensortype for no good reason; we can just print the dtype instead. Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/78529 Approved by: https://github.com/albanD	2022-06-01 00:46:52 +00:00
mikeiovine	d6db5ea50d	Back out "add mixed data type mode for LayerNorm forward path" Pull Request resolved: https://github.com/pytorch/pytorch/pull/78298 Also back out "improve LayerNorm bfloat16 performance on CPU". These layer norm changes seem fine, but they are causing `LayerNorm` to not use AVX2 instructions, which is causing performance on internal models to degrade. More investigation is needed to find the true root cause, but we should unland to mitigate the issue ASAP. I left `mixed_data_type.h` around since there are some other files depending on it. Differential Revision: [D36675352](https://our.internmc.facebook.com/intern/diff/D36675352/) Approved by: https://github.com/tenpercent	2022-05-26 02:54:13 +00:00
PyTorch MergeBot	c50089712c	Revert "Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545 )" This reverts commit `53ef66bb59`. Reverted https://github.com/pytorch/pytorch/pull/70545 on behalf of https://github.com/malfet due to as it broke cuda-10.2 test on trunk, see `53ef66bb59`	2022-05-23 23:58:43 +00:00
Kurt Mohler	53ef66bb59	Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545 ) Fixes #68727 cc @mruberry @jbschlosser @walterddr @kshitij12345 @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/70545 Approved by: https://github.com/ngimel	2022-05-23 21:08:25 +00:00
yuguo68	c186250d95	raise error when groups is not positive in Conv modules Pull Request resolved: https://github.com/pytorch/pytorch/pull/77919 Approved by: https://github.com/jbschlosser	2022-05-23 20:35:00 +00:00
Jeff Daily	9aed30d3ad	[ROCm] support benchmark flag for MIOpen (#77438 ) Fixes #68172. Generally, this corrects multiple flaky convolution unit test behavior seen on ROCm. The MIOpen integration has been forcing benchmark=True when calling `torch._C._set_cudnn_benchmark(False)`, typically called by `torch.backends.cudnn.set_flags(enabled=True, benchmark=False)`. We now add support for MIOpen immediate mode to avoid benchmarking during MIOpen solution selection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77438 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-05-23 17:10:24 +00:00
zrphercule	734a97a7c8	Revert "Revert "Switch to use nested tensor by-default in Transformer… (#77924 ) …Encoder (#77217)"" This reverts commit `0d6fa91d1b`. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/77924 Approved by: https://github.com/atalman	2022-05-20 11:44:03 +00:00
George Qi	f9db8b72ac	MHA forward pass bug fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/77761 Approved by: https://github.com/jbschlosser	2022-05-19 01:21:24 +00:00
Joel Benjamin Schlosser	8881d7ac6c	Support no-batch-dim for CrossEntropyLoss with prob target Pull Request resolved: https://github.com/pytorch/pytorch/pull/77653 Approved by: https://github.com/albanD	2022-05-18 19:51:09 +00:00
Nikita Vedeneev	a760dc2687	`binary_cross_entropy`: double backwart wrt target (#77416 ) As per title. An effort to make `binary_cross_entropy` all around differentiable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77416 Approved by: https://github.com/soulitzer	2022-05-18 10:29:27 +00:00
Rui Zhu	4e2f5507d0	Add support for TxT mask layout for masked_softmax in BetterTransformer (#77607 ) Summary: Expand mask to BxHxDxD when mask is DxD layout Test Plan: buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/opt/gen/caffe2/test/nn\#binary.par -r masked_softmax_DxD Differential Revision: D36428170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77607 Approved by: https://github.com/cpuhrsch	2022-05-18 01:31:05 +00:00
PyTorch MergeBot	d8b80edade	Revert "Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435 )" This reverts commit `1aa3cbb83b`. Reverted https://github.com/pytorch/pytorch/pull/76435 on behalf of https://github.com/jbschlosser	2022-05-17 17:51:26 +00:00
mingfeima	c003494754	add channels last support for PixelShuffle and PixelUnshuffle Pull Request resolved: https://github.com/pytorch/pytorch/pull/50573 Approved by: https://github.com/VitalyFedyunin	2022-05-17 17:33:49 +00:00
Edward Z. Yang	b5bc954a71	Fix optional dtype/layout/memory_format pycall; fix memory format Double-header bug fix: - As reported by jansel, dtypes are still showing up as integers when the schema is an optional dtype. This is simple enough to fix and I added a test for it. But while I was at it... - I noticed that the THPMemoryFormat_new idiom with "unused" name doesn't actually work, the repr of the returned memory format object is wrong and this shows up when we try to log the args/kwargs. So I fixed memory format to do it properly along with everything else. Fixes https://github.com/pytorch/pytorch/issues/77135 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/77543 Approved by: https://github.com/albanD, https://github.com/jansel	2022-05-16 16:46:08 +00:00
mingfeima	8c50414233	add BFloat16 support for BatchNorm on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/77496 Approved by: https://github.com/frank-wei	2022-05-16 16:31:18 +00:00
mingfeima	6fa20bdfe8	add native kernel for weight_norm on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/73845 Approved by: https://github.com/frank-wei	2022-05-16 06:36:24 +00:00
PyTorch MergeBot	93a969221d	Revert "add BFloat16 support for BatchNorm on CPU" This reverts commit `7c8911ca7a`. Reverted https://github.com/pytorch/pytorch/pull/74410 on behalf of https://github.com/albanD	2022-05-14 14:28:58 +00:00
mingfeima	7c8911ca7a	add BFloat16 support for BatchNorm on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/74410 Approved by: https://github.com/frank-wei	2022-05-14 07:49:00 +00:00
Rohan Varma	a275491c6f	[Reland] load_state_dict post hook (#77392 ) Reland of https://github.com/pytorch/pytorch/pull/76823 with fixes to call `__setstate__` for softmax/softmin/logsoftmax as per discussion with @albanD and @jbschlosser. Original description: Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287. Unittests cover: - Ensuring hooks are called with the correct module - Hook is called with `IncompatibleKeys` field - If hook modifies this, load_state_dict returns the modified result Pull Request resolved: https://github.com/pytorch/pytorch/pull/77392 Approved by: https://github.com/jbschlosser	2022-05-14 06:06:23 +00:00
mingfeima	59b56ba785	improve group_norm channels last performance on CPU add channels_last_3d memory format support add BFloat16 support on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/69067 Approved by: https://github.com/VitalyFedyunin	2022-05-14 03:13:02 +00:00
Kulin Seth	e011a8e18b	Enable PyTorch operations on MPS Backend. (#77343 ) Add PyTorch operations to MPS backend. - https://github.com/pytorch/pytorch/issues/77394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77343 Approved by: https://github.com/albanD	2022-05-13 18:28:53 +00:00
mingfeima	2b7943c47c	fix torchvhsion failed case test_classification_model on slow_conv2d Pull Request resolved: https://github.com/pytorch/pytorch/pull/77347 Approved by: https://github.com/datumbox, https://github.com/frank-wei	2022-05-13 08:04:08 +00:00
PyTorch MergeBot	d92b0a51aa	Revert "Load state dict post hook" This reverts commit `56bed0dcfe`. Reverted https://github.com/pytorch/pytorch/pull/76823 on behalf of https://github.com/rohan-varma	2022-05-12 21:00:49 +00:00
ecao	37c6017831	Add BFloat16 support for GLU, and randperm operators on CPU (#61944 ) add BFloat16 support for GLU and randperm operators on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/61944 Approved by: https://github.com/frank-wei	2022-05-12 17:41:57 +00:00
yanbing-j	4f82f439d1	Enable BFloat16 ELU, SELU and CELU in CPU path (#62546 ) Enable BFloat16 ELU, SELU and CELU in CPU path. SELU and CELU will call ELU implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62546 Approved by: https://github.com/frank-wei	2022-05-12 16:56:57 +00:00
mingfeima	3b56efd4e1	add mixed data type mode for LayerNorm forward path Pull Request resolved: https://github.com/pytorch/pytorch/pull/73844 Approved by: https://github.com/frank-wei	2022-05-12 03:35:06 +00:00
otaj	1aa3cbb83b	Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435 ) Fixes #76434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76435 Approved by: https://github.com/jbschlosser	2022-05-11 18:40:59 +00:00
mingfeima	3d0e6f169c	add channels last support for slow_conv_dilated2d Pull Request resolved: https://github.com/pytorch/pytorch/pull/70665 Approved by: https://github.com/VitalyFedyunin	2022-05-11 15:28:50 +00:00
Rui Zhu	533b44a280	Add _native nested_tensor_from_mask (#76942 ) Summary: For user to convert nested tensor more easily. Some impl detail might change on user's need. Test Plan: buck test mode/dev caffe2/test:nn -- test_nested_tensor_from_mask Differential Revision: D36191182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76942 Approved by: https://github.com/jbschlosser	2022-05-11 05:19:36 +00:00
mingfeima	3d561ee926	add channels last support for thnn_conv2d (non-dilated) Pull Request resolved: https://github.com/pytorch/pytorch/pull/68101 Approved by: https://github.com/VitalyFedyunin	2022-05-11 00:09:45 +00:00
neverix	87e543da9b	Add `load_state_dict` error message for non-dicts (#77197 ) Fixes #76886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77197 Approved by: https://github.com/jbschlosser	2022-05-10 22:11:51 +00:00
Aidyn-A	a127c584a0	Fix max pool forward nhwc (#76597 ) Fixes issue #76432. Added dilation to loops in CUDA kernel. cc @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/76597 Approved by: https://github.com/ngimel	2022-05-10 17:39:48 +00:00
mingfeima	8d4e069e66	add BFloat16 support for UpSample on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/76935 Approved by: https://github.com/frank-wei	2022-05-10 16:56:41 +00:00
Scott Wolchok	e5915a2216	[PyTorch] Don't enter MHA fast path when bias & query dtypes don't match Pull Request resolved: https://github.com/pytorch/pytorch/pull/76879 The fast path does not support this: transform_bias_rescale_qkv will try to grab bias.data_ptr() assuming the dtypes are the same. (Also, I have no idea how this happens.) Differential Revision: [D36156872](https://our.internmc.facebook.com/intern/diff/D36156872/) Approved by: https://github.com/cpuhrsch	2022-05-09 18:21:04 +00:00
Rohan Varma	56bed0dcfe	Load state dict post hook Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287. Unittests cover: - Ensuring hooks are called with the correct module - Hook is called with `IncompatibleKeys` field - If hook modifies this, load_state_dict returns the modified result Pull Request resolved: https://github.com/pytorch/pytorch/pull/76823 Approved by: https://github.com/albanD	2022-05-05 19:27:05 +00:00
lkct	b8776e143f	Fix false DeprecationWarning in `Module.state_dict` Fixes #75404 TODO: - [x] add tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/75507 Approved by: https://github.com/jbschlosser	2022-05-04 20:08:23 +00:00
Nikita Shulga	b074bffa41	Revert D28836788: add BFloat16 support for UpSample on CPU Test Plan: revert-hammer Differential Revision: D28836788 (`1399d83bc0`) Original commit changeset: 63dc45e5bb91 Original Phabricator Diff: D28836788 (`1399d83bc0`) fbshipit-source-id: 92733af87cba87aed800473ff44ca6d7af037da9 (cherry picked from commit 1c9fc492503b768a343723e4cf347b30bf5dcfc2)	2022-05-02 23:13:39 +00:00
mingfeima	1399d83bc0	add BFloat16 support for UpSample on CPU (#58297 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58297 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D28836788 Pulled By: VitalyFedyunin fbshipit-source-id: 63dc45e5bb91964d5ff1110262228718289435d1 (cherry picked from commit 8a37d607d6a89ccb50364cf54a6f26ca8d05cab9)	2022-05-02 22:33:26 +00:00
Scott Wolchok	e816e17655	[PyTorch] Add native fast path for transformer encoder inference (#76333 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76333 The current PyTorch multi-head attention and transformer implementations are slow. This should speed them up for inference. ghstack-source-id: 154737857 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: cpuhrsch Differential Revision: D35239925 fbshipit-source-id: 5a7eb8ff79bc6afb4b7d45075ddb2a24a6e2df28	2022-04-26 12:58:03 -04:00
Jon Janzen	2387efd356	Revert "[PyTorch] Add native fast path for transformer encoder inference" This reverts commit `b369b89f23`. This has internal changes and should not have been landed via mergebot. Ref: https://github.com/pytorch/pytorch/pull/75809#issuecomment-1108717166	2022-04-25 11:40:02 -04:00
Scott Wolchok	b369b89f23	[PyTorch] Add native fast path for transformer encoder inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/75809 The current PyTorch multi-head attention and transformer implementations are slow. This should speed them up for inference. Differential Revision: [D35239925](https://our.internmc.facebook.com/intern/diff/D35239925/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35239925/)! Approved by: https://github.com/ezyang	2022-04-25 06:11:36 +00:00
Peter Bell	cb37e7a080	Remove F.pad python implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/73433 Approved by: https://github.com/albanD, https://github.com/jbschlosser	2022-04-23 00:13:20 +00:00
Joel Benjamin Schlosser	041e6e750a	Fix to support no-batch-dim inputs in ConvTransposeNd._output_padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/76151 Approved by: https://github.com/albanD	2022-04-22 19:25:09 +00:00
Nikita Vedeneev	9e137ee583	more numerically stable cosine_similarity Previous behavior: compute inner product, then normalize. This patch: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms. By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [#29442](https://github.com/pytorch/pytorch/issues/29442). P.S. I had to change tests because this implementation handles division by 0 differently. This PR computes cosine similarity as follows: <x/max(eps, \|\|x\|\|), y/max(eps, \|\|y\|\|)>. Let f(x,y) = <x,y>/(\|\|x\|\| * \|\|y\|\|), then df/dx = y/(\|\|x\|\| * \|\|y\|\|) - (\|\|y\|\|/\|\|x\|\| * <x,y> * x)/(\|\|x\|\| * \|\|y\|\|)^2. The changed test checks division by zero in backward when x=0 and y != 0. For this case the non-zero part of the gradient is just y / (\|\|x\|\| * \|\|y\|\|). The previous test evaluates y/(\|\|x\|\| * \|\|y\|\|) to y / eps, and this PR to 1/eps * y/\|\|y\|\|. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31378 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-04-22 09:28:50 +00:00
arindamroy-eng	7478ce187a	ROCM:Unskip more tests for ROCM5.0 Re-enabling more tests which are working on ROCM5.0 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/75353 Approved by: https://github.com/ezyang	2022-04-19 19:45:55 +00:00
George Qi	f5517761aa	add operator header Pull Request resolved: https://github.com/pytorch/pytorch/pull/71502 Approved by: https://github.com/zrphercule, https://github.com/cpuhrsch	2022-04-19 15:23:25 +00:00
yanbing-j	dc2e630341	Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path (#63634 ) Summary: In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU. The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization. The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel. In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation. We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`. Share weights: ![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png) ![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png) Multiple weights: ![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png) ![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png) cc albanD mruberry jbschlosser walterddr Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634 Reviewed By: yinghai Differential Revision: D34031616 Pulled By: frank-wei fbshipit-source-id: 04e2a0f9e92c658fba7ff56b1010eacb7e8ab44c (cherry picked from commit ed262b15487557720bb0d498f9f2e8fcdba772d9)	2022-04-15 21:46:24 +00:00
PyTorch MergeBot	e8ed042043	Revert "Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path" This reverts commit `263c4c2a95`. Reverted https://github.com/pytorch/pytorch/pull/63634 on behalf of https://github.com/seemethere	2022-04-15 21:41:51 +00:00
yanbing-j	263c4c2a95	Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU. The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization. The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel. In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation. We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`. Share weights: ![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png) ![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png) Multiple weights: ![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png) ![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png) cc @albanD @mruberry @jbschlosser @walterddr Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634 Approved by: https://github.com/frank-wei, https://github.com/seemethere	2022-04-15 20:34:58 +00:00
Scott Wolchok	56f801e788	[PyTorch] Add test for all-masked case for native softmax It returns all NaNs. CUDA implementation required a fix for this. Differential Revision: [D35327730](https://our.internmc.facebook.com/intern/diff/D35327730/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75803 Approved by: https://github.com/ngimel	2022-04-14 21:30:57 +00:00
Scott Wolchok	d4c527e738	[PyTorch] Run test_transformerencoderlayer_gelu on CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/75347 Preparing to add native fast path; need to test on CUDA! Differential Revision: [D35327729](https://our.internmc.facebook.com/intern/diff/D35327729/) Approved by: https://github.com/ngimel	2022-04-14 21:30:57 +00:00
Scott Wolchok	96cf8a450a	[PyTorch] Run test_transformerencoderlayer on CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/75346 Preparing to add native fast path; need to test on CUDA! Differential Revision: [D35327731](https://our.internmc.facebook.com/intern/diff/D35327731/) Approved by: https://github.com/ngimel	2022-04-14 21:30:56 +00:00
Rohan Varma	a4126e5936	Add test to make sure submodule hooks fire Pull Request resolved: https://github.com/pytorch/pytorch/pull/75421 As part of FSDP work, we will be relying on `_register_load_state_dict_pre_hook` to manage some specific logic related to loading state dicts. This PR adds a test to ensure that _register_load_state_dict_pre_hook can be used to register hooks on modules that will be used in a nested way, and then calling load_state_dict on the overall module still calls those hooks appropriately. Differential Revision: [D35434726](https://our.internmc.facebook.com/intern/diff/D35434726/) Approved by: https://github.com/albanD	2022-04-12 20:11:38 +00:00
HDCharles	25ee52570e	[ao][sparsity] comsability for sparsity and QAT convert Summary: The primary issue for enabling sparsity to work with QAT convert (unlike normal quantization convert) is that when the parametrized module undergoes the QAT convert, the parametrizations need to be maintained. If the parametrizations don't get transfered during the convert, the sparsifier would lose its connection to the model. In practice this was handled using the transfer_parametrizations_and_params function to move the weight and bias and any associated paramerizations to the new module. This PR also adds tests for transfer_parametrizations_and_params and type_before_parametrizations to test_nn.py and also added comments to the test code for composability. Test Plan: python test/test_ao_sparsity.py TestComposability python test/test_nn.py TestNN Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74848 Approved by: https://github.com/vkuzo, https://github.com/Lezcano	2022-04-11 16:32:08 +00:00
kshitij12345	e177d2cc44	[complex] conv3d Reference: https://github.com/pytorch/pytorch/issues/71108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75581 Approved by: https://github.com/anjali411	2022-04-10 19:37:10 +00:00
soulitzer	b10d151745	Ensure convolution_backward respects output_mask Pull Request resolved: https://github.com/pytorch/pytorch/pull/75298 Approved by: https://github.com/albanD	2022-04-08 19:27:41 +00:00
Kshiteej K	fe799374de	[complex] conv2d Reference: https://github.com/pytorch/pytorch/issues/71108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75412 Approved by: https://github.com/anjali411	2022-04-08 16:26:39 +00:00
kshitij12345	706b9e8b8d	[reland] [complex] conv1d Reland : #75013 Reference: #71108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75310 Approved by: https://github.com/anjali411	2022-04-06 17:12:41 +00:00
Jianyu Huang	b8a4708ac0	[pt] Add half precision support for nn.EmbeddingBag (CPU) (#74844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74844 - Use FBGEMM/perf kernel implementation for the fast path. - Use FP32 accumulation for FP16 weight embeddings (`index_select_add` and `index_select_scale_add`). - Add the unit test coverage. Test Plan: ``` buck run mode/opt //ai_codesign/nonprod/jianyuhuang/pytorch_examples:eb Parsing buck files: finished in 0.6 sec Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 01:52.1 min (100%) 12247/12247 jobs, 2/12247 updated Total time: 01:52.8 min BUILD SUCCEEDED tensor([[ 0.1282, -0.0244, 1.0996], [-1.2285, -0.8643, 2.6621]], dtype=torch.float16, grad_fn=<EmbeddingBagBackward0>) tensor([[[-0.1643, 0.1266, -0.4851], [ 0.0710, 0.5024, 0.2798], [ 0.4797, 0.5991, -0.0188], [ 0.8843, 1.2090, 1.6494]], [[ 0.4797, 0.5991, -0.0188], [ 0.0662, -0.4121, 1.5752], [ 0.0710, 0.5024, 0.2798], [-0.8242, 0.2668, -0.6177]]], dtype=torch.float16, grad_fn=<EmbeddingBackward0>) ``` ``` $ buck run mode/opt //caffe2/test:nn -- -r test_embedding_bag_half 2>&1 \| tee output.log PARSING BUCK FILES: FINISHED IN 0.8s CREATING ACTION GRAPH: FINISHED IN 0.0s test_embedding_bag_half_cpu_int32_int32 (test_nn.TestNNDeviceTypeCPU) ... ok test_embedding_bag_half_cpu_int32_int64 (test_nn.TestNNDeviceTypeCPU) ... ok test_embedding_bag_half_cpu_int64_int32 (test_nn.TestNNDeviceTypeCPU) ... ok test_embedding_bag_half_cpu_int64_int64 (test_nn.TestNNDeviceTypeCPU) ... ok test_embedding_bag_half_cuda_int32_int32 (test_nn.TestNNDeviceTypeCUDA) ... ok test_embedding_bag_half_cuda_int32_int64 (test_nn.TestNNDeviceTypeCUDA) ... ok test_embedding_bag_half_cuda_int64_int32 (test_nn.TestNNDeviceTypeCUDA) ... ok test_embedding_bag_half_cuda_int64_int64 (test_nn.TestNNDeviceTypeCUDA) ... ok ---------------------------------------------------------------------- Ran 8 tests in 44.621s OK ``` ``` TORCH_SHOW_CPP_STACKTRACES=1 buck run mode/opt //caffe2/test:nn -- -r test_EmbeddingBag_per_sample_weights_and_new_offsets 2>&1 \| tee output.log ``` Reviewed By: jasonjk-park Differential Revision: D35190299 fbshipit-source-id: d1daa6e837660259b92a1f316b09f38e509ee077 (cherry picked from commit 86f575f2c4cd407c13d4b2eaeea94b59f74642af)	2022-04-06 01:27:46 +00:00
PyTorch MergeBot	862f67454f	Revert "[complex] conv1d" This reverts commit `b64e7dee51`. Reverted https://github.com/pytorch/pytorch/pull/75013 on behalf of https://github.com/mruberry	2022-04-05 20:18:50 +00:00
kshitij12345	b64e7dee51	[complex] conv1d Reference: https://github.com/pytorch/pytorch/issues/71108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75013 Approved by: https://github.com/anjali411	2022-04-05 17:29:59 +00:00
CaoE	77c7a50d46	Add BFloat16 support for logsigmoid, hardsigmoid, hardshrink, softshrink, hardswish and softplus on CPU (#63134 ) Summary: Add BFloat16 support for logsigmoid, hardsigmoid, hardshrink, softshrink, hardswish and softplus on CPU, and optimize the performance of softshrink. Pull Request resolved: https://github.com/pytorch/pytorch/pull/63134 Reviewed By: yinghai Differential Revision: D34897992 Pulled By: frank-wei fbshipit-source-id: 4c778f5271d6fa54dd78158258941def8d9252f5 (cherry picked from commit decda0e3debf56cc5c4d7faea41b1165a7cabe12)	2022-04-04 20:31:22 +00:00
Nikita Karetnikov	936a65056e	Use the same checks in all `grid_sampler` functions Fixes #73187. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75164 Approved by: https://github.com/albanD	2022-04-04 15:21:44 +00:00
CaoE	c5872e6d6d	Add BFloat16 support for smooth_l1_loss on CPU (#62558 ) Summary: Add BFloat16 support for smooth_l1_loss on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62558 Reviewed By: H-Huang Differential Revision: D34897859 Pulled By: frank-wei fbshipit-source-id: a52138c89852642db78f5f3083d05873f3cdec3a (cherry picked from commit 71908ee3de7ca0580a073350353ce6f234a8c6ff)	2022-04-04 06:05:46 +00:00
Scott Wolchok	8f4f1638bb	[PyTorch] Flip polarity of masked_softmax mask (#78 ) Summary: X-link: https://github.com/pytorch/pytorch-canary/pull/78 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75039 It didn't match torch.nn.MultiheadAttention. Now it does. ghstack-source-id: 152815449 Test Plan: updated tests Reviewed By: zrphercule Differential Revision: D34929186 fbshipit-source-id: 1eaee615bafd5a6f058f1faefa54f8f4aa01c92e (cherry picked from commit 00eea72a06fb924112f1036c8f0c5ed08eb0d02c)	2022-04-02 00:17:49 +00:00
Kyle Chen	f888dc5842	[ROCm] re-enable test_Conv2d_groups_nobias tests fixes: https://github.com/pytorch/pytorch/pull/59158 https://github.com/pytorch/pytorch/pull/58701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75008 Approved by: https://github.com/albanD	2022-04-01 21:42:23 +00:00
Alban Desmaison	0ce02ea52d	Revert D35284563: Use the same checks in all `grid_sampler` functions Test Plan: revert-hammer Differential Revision: D35284563 (`835cc66e5d`) Original commit changeset: 1477c506b875 Original Phabricator Diff: D35284563 (`835cc66e5d`) fbshipit-source-id: 7260f4dfda23bd60200e5ba2c5bf3e4f833c2646 (cherry picked from commit fbe082905ef678e7dd70dbc9520dca644383ce01)	2022-04-01 16:45:46 +00:00
Nikita Karetnikov	835cc66e5d	Use the same checks in all `grid_sampler` functions (#74635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74635 Fixes #73187. Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D35284563 Pulled By: albanD fbshipit-source-id: 1477c506b8755d864ca902ee140bee7bdb0069b0 (cherry picked from commit dcbd5242baaae11f9e323d99a9596e5b88e86bd7)	2022-04-01 14:26:16 +00:00
Nikita Shulga	bfac65dfe5	[testing] Update dispatch macros (#74977 ) This PR is reland of #74289 Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>	2022-03-30 14:13:21 -07:00
PyTorch MergeBot	2e4152b118	Revert "[testing] Update dispatch macros" This reverts commit `eed19a0f38`. Reverted https://github.com/pytorch/pytorch/pull/74289 on behalf of https://github.com/malfet	2022-03-30 19:52:37 +00:00
kshitij12345	273c2f0124	EmbeddingBagCUDA: remove oob check for perf Fixes: https://github.com/pytorch/pytorch/issues/74751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74767 Approved by: https://github.com/ngimel, https://github.com/xuzhao9	2022-03-30 16:55:49 +00:00
Khushi Agrawal	eed19a0f38	[testing] Update dispatch macros Hi, This PR is the follow-up PR of #71561. (the previous PR had a couple of merge conflicts and was reverted, this PR resolves that). Please take a look. Thanks! cc: @pmeier @mruberry @kshitij12345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74289 Approved by: https://github.com/pmeier, https://github.com/mruberry	2022-03-30 16:10:16 +00:00
Michael	3157b1abd7	CrossEntropyLoss triggers floating point exception Adds cast to float to fix floating point exception Fixes #73165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/73837 Approved by: https://github.com/jbschlosser	2022-03-22 14:20:30 +00:00
XiaobingSuper	4a0f6e6c53	report an error if num_channels is not divisible by num_groups for nn.GroupNorm For a GroupNorm module, if num_channels is not divisible by num_groups, we need to report an error when defining a module other than at the running step. example: ``` import torch m = torch.nn.GroupNorm(5, 6) x = torch.randn(1, 6, 4, 4) y = m(x) ``` before: ``` Traceback (most recent call last): File "group_norm_test.py", line 8, in <module> y = m(x) File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl return forward_call(input, *kwargs) File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 271, in forward input, self.num_groups, self.weight, self.bias, self.eps) File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/functional.py", line 2500, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected number of channels in input to be divisible by num_groups, but got input of shape [1, 6, 4, 4] and num_groups=5 ``` after: ``` Traceback (most recent call last): File "group_norm_test.py", line 6, in <module> m = torch.nn.GroupNorm(5, 6) File "/home/xiaobinz/miniconda3/envs/pytorch_test/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 251, in __init__ raise ValueError('num_channels must be divisible by num_groups') ``` This PR also update the doc of num_groups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74293 Approved by: https://github.com/jbschlosser	2022-03-17 13:40:47 +00:00
Nikita Shulga	ef066f0832	Revert D34856571: [pytorch][PR] Replace `get_all_` type macros with the ATen dispatch macros. Test Plan: revert-hammer Differential Revision: D34856571 (`3ded7b1da3`) Original commit changeset: 0dca038bcad5 Original Phabricator Diff: D34856571 (`3ded7b1da3`) fbshipit-source-id: 594553fa0b710d78beba59d5d2b646f1f1270386 (cherry picked from commit 8090eb9b12dcf452a9e7dc01792a66fb91b563b6)	2022-03-15 22:07:11 +00:00
Khushi Agrawal	3ded7b1da3	Replace `get_all_` type macros with the ATen dispatch macros. (#71561 ) Summary: Hi, Team! The PR is motivated from https://github.com/pytorch/pytorch/pull/71153#discussion_r782446738. It aims to replace `get_all` type macros with the ATen dispatch macros. The files it iterates over are: (Thanks, Lezcano, for the idea!!) <details> <summary> `test/test_autograd.py`</summary> <p> ```python 43:from torch.testing._internal.common_dtype import get_all_dtypes 8506: floating_dt = [dt for dt in get_all_dtypes() if dt.is_floating_point] ``` </p> </details> <details> <summary> `test/test_binary_ufuncs.py`</summary> <p> ```python 26: all_types_and_complex_and, integral_types_and, get_all_dtypes, get_all_int_dtypes, get_all_math_dtypes, 27: get_all_complex_dtypes, get_all_fp_dtypes, 935: dtypes(get_all_dtypes(include_bool=False, include_complex=False)) 1035: dtypes(get_all_dtypes( 1488: dtypes((get_all_dtypes(include_bool=False, include_bfloat16=False))) 1879: dtypes(product(get_all_dtypes(include_complex=False), get_all_dtypes(include_complex=False))) 1887: dtypes((get_all_int_dtypes() + [torch.bool])) 1913: dtypes((get_all_fp_dtypes())) 1941: dtypes((get_all_fp_dtypes())) 1977: dtypes(product(get_all_complex_dtypes(), get_all_dtypes())) 2019: dtypes(product(get_all_fp_dtypes(), get_all_fp_dtypes())) 2048: dtypes(get_all_dtypes()) 2110: dtypes(product(get_all_dtypes(include_complex=False), 2111: get_all_dtypes(include_complex=False))) 2128: types = [torch.bool, torch.bfloat16] + get_all_int_dtypes() 2173: if dtypes[1] in get_all_fp_dtypes(): 2178: dtypes(product(get_all_fp_dtypes(), 2179: get_all_fp_dtypes())) 2260: dtypesIfCUDA(set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128}) 2261: dtypes(set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128}) 2273: dtypesIfCUDA(set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128}) 2274: dtypes(set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128}) 2307: dtypes(get_all_math_dtypes('cpu')) 2319: dtypes(get_all_fp_dtypes(include_bfloat16=False)) 2331: dtypes(get_all_int_dtypes()) 2356: dtypes(get_all_dtypes(include_bfloat16=False, include_bool=False, include_complex=False)) 2393: if dtype in get_all_int_dtypes(): 2614: dtypes(get_all_dtypes()) 2624: dtypes(tuple(itertools.combinations_with_replacement(get_all_dtypes(), 2))) 2806: dtypes(list(product(get_all_dtypes(include_complex=False), 2807: get_all_dtypes(include_complex=False)))) 2866: dtypes(list(product(get_all_complex_dtypes(), 2867: get_all_complex_dtypes()))) 2902: dtypes(product(get_all_dtypes(), get_all_dtypes())) 2906: dtypes(product(get_all_dtypes(), get_all_dtypes())) 2910: dtypes(product(get_all_dtypes(), get_all_dtypes())) 3019: dtypes = [torch.float, torch.double] + get_all_complex_dtypes() 3221: dtypes(get_all_dtypes(include_complex=False)) 3407: dtypes(list(product(get_all_dtypes(include_bool=False), 3408: get_all_dtypes(include_bool=False)))) 3504: dtypes(product(get_all_dtypes(include_complex=False, include_bfloat16=False), 3505: get_all_dtypes(include_complex=False, include_bfloat16=False))) 3516: if x.dtype in get_all_int_dtypes() + [torch.bool]: 3643: dtypes(product(get_all_dtypes(include_complex=False, 3645: get_all_dtypes(include_complex=False, ``` </p> </details> <details> <summary> `test/test_complex.py`</summary> <p> ```python 6:from torch.testing._internal.common_dtype import get_all_complex_dtypes 11: dtypes(get_all_complex_dtypes()) ``` </p> </details> <details> <summary> `test/test_foreach.py`</summary> <p> ```python 18: get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes, 142: if dtype in get_all_int_dtypes(): 179: disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool] 201: disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool] 205: disable_fastpath \|= dtype in get_all_int_dtypes() + [torch.bool] 211: disable_fastpath \|= dtype not in get_all_complex_dtypes() 241: bool_int_div = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool] 246: disable_fastpath \|= dtype in get_all_int_dtypes() + [torch.bool] 248: disable_fastpath \|= dtype not in get_all_complex_dtypes() 250: disable_fastpath \|= True and dtype not in get_all_complex_dtypes() 307: disable_fastpath = dtype in get_all_int_dtypes() + [torch.bool] 365: if opinfo.name == "_foreach_abs" and dtype in get_all_complex_dtypes(): 376: ops(foreach_unary_op_db, dtypes=get_all_dtypes()) 393: dtypes=get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False)) 401: ops(foreach_minmax_op_db, dtypes=get_all_fp_dtypes(include_bfloat16=True, include_half=True)) 426: if ord in (1, 2) and dtype in torch.testing.get_all_fp_dtypes(): 439: dtypes(get_all_dtypes()) 449: ops(foreach_binary_op_db, dtypes=get_all_dtypes()) 481: ops(foreach_binary_op_db, dtypes=get_all_dtypes()) 536: if dtype in get_all_int_dtypes() + [torch.bool] and foreach_op == torch._foreach_div: 545: ops(foreach_binary_op_db, dtypes=get_all_dtypes()) 637: ops(foreach_pointwise_op_db, allowed_dtypes=get_all_fp_dtypes(include_half=False, include_bfloat16=False)) ``` </p> </details> <details> <summary> `test/test_linalg.py`</summary> <p> ```python 29: all_types, floating_types, floating_and_complex_types, get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes, 30: get_all_fp_dtypes, 111: dtypes((get_all_dtypes())) 794: float_and_complex_dtypes = get_all_fp_dtypes() + get_all_complex_dtypes() 807: dtypes((get_all_int_dtypes())) 828: dtypes((get_all_fp_dtypes() + get_all_complex_dtypes())) 841: if dtype in get_all_complex_dtypes(): 844: dtypes(itertools.product(get_all_dtypes(), 845: get_all_dtypes())) 855: for dtypes0, dtypes1, dtypes2 in product(get_all_dtypes(), repeat=3): 5607: get_all_fp_dtypes(include_half=not CUDA9, include_bfloat16=(CUDA11OrLater and SM53OrLater))) 5608: dtypes((set(get_all_dtypes()) - {torch.half, torch.bool})) 5644: dtypes((get_all_complex_dtypes() + get_all_fp_dtypes())) 6255: dtypesIfCUDA(get_all_complex_dtypes(), 6256: get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)), 6292: dtypesIfCUDA(get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)))) 6323: dtypesIfCUDA(get_all_complex_dtypes(), 6324: get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)))) 6325: dtypes(get_all_complex_dtypes(), get_all_fp_dtypes()) 6358: dtypesIfCUDA(([torch.float, torch.double] + get_all_complex_dtypes())) 6556: dtypes(get_all_fp_dtypes(), get_all_complex_dtypes()) 6668: dtypes(get_all_fp_dtypes(), get_all_complex_dtypes()) 6741: dtypes(get_all_fp_dtypes(), get_all_complex_dtypes()) ``` </p> </details> <details> <summary> `test/test_nn.py`</summary> <p> ```python 37:from torch.testing._internal.common_dtype import integral_types, get_all_fp_dtypes, get_all_math_dtypes 50: onlyNativeDeviceTypes, deviceCountAtLeast, largeTensorTest, expectedFailureMeta, skipMeta, get_all_device_types, \ 8862: for device in get_all_device_types(): 9629: for dt1 in get_all_math_dtypes(device): 9630: for dt2 in get_all_math_dtypes(device): 9631: for dt3 in get_all_math_dtypes(device): 9648: for input_dtype in get_all_math_dtypes(device): 9664: for input_dtype in get_all_math_dtypes(device): 13015: dtypes(get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM)) 13034: dtypes(get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM)) 13159: dtypes(get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM)) 17400: dtypesIfCUDA(get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM)) 17768: dtypesIfCUDA(get_all_fp_dtypes()) 17773: dtypesIfCUDA(get_all_fp_dtypes()) 17778: dtypesIfCUDA(get_all_fp_dtypes()) 17783: dtypesIfCUDA(get_all_fp_dtypes()) 17788: dtypesIfCUDA(get_all_fp_dtypes()) 17793: dtypesIfCUDA(get_all_fp_dtypes()) 17798: dtypesIfCUDA(get_all_fp_dtypes()) 17963: dtypesIfCUDA(get_all_fp_dtypes()) 17977: dtypesIfCUDA(get_all_fp_dtypes()) 18684: def test_cross_entropy_loss_prob_target_all_reductions(self, device): ``` </p> </details> <details> <summary> `test/test_numpy_interop.py`</summary> <p> ```python 12:from torch.testing._internal.common_dtype import get_all_dtypes 399: dtypes(get_all_dtypes()) ``` </p> </details> <details> <summary> `test/test_ops.py`</summary> <p> ```python 12:from torch.testing._internal.common_dtype import floating_and_complex_types_and, get_all_dtypes 86: for dtype in get_all_dtypes(): ``` </p> </details> <details> <summary> `test/test_reductions.py`</summary> <p> ```python 16: get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes, 360: allowed_dtypes=get_all_dtypes(include_bfloat16=False)) 366: allowed_dtypes=get_all_dtypes(include_bfloat16=False)) 394: allowed_dtypes=get_all_dtypes(include_bfloat16=False)) 750: for dtype in [dtype for dtype in get_all_math_dtypes('cpu') if dtype != torch.float16]: 1404: dtypes(get_all_dtypes(include_bool=False, include_complex=False)) 1457: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) + 1458: get_all_complex_dtypes())) 1465: return dtype in get_all_int_dtypes() 1494: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))) 1501: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))) 1507: dtypes((get_all_complex_dtypes())) 1514: dtypes = list(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)) 1523: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))) 1531: if dtype in get_all_fp_dtypes(): 1608: dtypes((get_all_dtypes(include_half=True, include_bfloat16=False, 1837: dtypes(get_all_dtypes(include_bool=False, include_complex=False)) 1855: dtypes((set(get_all_dtypes(include_bool=False, include_complex=False)) - {torch.uint8})) 3219: for dtype in get_all_dtypes(include_half=True, include_bfloat16=False, ``` </p> </details> <details> <summary> `test/test_serialization.py`</summary> <p> ```python 26:from torch.testing._internal.common_dtype import get_all_dtypes 586: for device, dtype in product(devices, get_all_dtypes()): 589: for other_dtype in get_all_dtypes(): ``` </p> </details> <details> <summary> `test/test_shape_ops.py`</summary> <p> ```python 18:from torch.testing._internal.common_dtype import get_all_dtypes 230: dtypes(get_all_dtypes(include_complex=False, include_bool=False, include_half=False, 232: dtypesIfCUDA(get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False)) 344: dtypes(get_all_dtypes()) 443: dtypes(get_all_dtypes()) 461: dtypes(get_all_dtypes()) 570: dtypes(get_all_dtypes(include_complex=False)) ``` </p> </details> <details> <summary> `test/test_sort_and_select.py`</summary> <p> ```python 12: all_types, all_types_and, floating_types_and, get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes, 136: dtypes(set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128}) 231: dtypes(set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128}) 296: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 647: dtypesIfCUDA(get_all_fp_dtypes()) 678: dtypesIfCUDA((get_all_dtypes(include_complex=False, 682: dtypes((get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False))) 739: dtypesIfCPU(set(get_all_dtypes()) - {torch.complex64, torch.complex128}) 740: dtypes(set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128}) 799: dtypesIfCPU(set(get_all_dtypes()) - {torch.complex64, torch.complex128}) 800: dtypes(set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128}) ``` </p> </details> <details> <summary> `test/test_sparse.py`</summary> <p> ```python 20:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes 29: floating_and_complex_types, floating_and_complex_types_and, get_all_dtypes, get_all_int_dtypes, 1963: return dtype in get_all_int_dtypes() 1994: dtypes(get_all_dtypes(include_bool=False, include_half=False, 2103: return dtype in get_all_int_dtypes() 2138: dtypes(get_all_dtypes(include_bool=False, include_half=False, 2626: all_sparse_dtypes = get_all_dtypes(include_complex=True) 2633: all_sparse_dtypes = get_all_dtypes(include_complex=True) 3230: dtypes(get_all_complex_dtypes(), 3231: get_all_fp_dtypes(include_half=False, include_bfloat16=False)) 3234: get_all_fp_dtypes( ``` </p> </details> <details> <summary> `test/test_sparse_csr.py`</summary> <p> ```python 7:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes, floating_and_complex_types, make_tensor 17:from torch.testing._internal.common_dtype import floating_types, get_all_dtypes 120: dtypes(get_all_dtypes()) 133: dtypes(get_all_dtypes()) 150: dtypes(get_all_dtypes()) 180: dtypes(get_all_dtypes()) 201: dtypes(get_all_dtypes()) 210: dtypes(get_all_dtypes()) 225: dtypes(get_all_dtypes()) 244: dtypes(get_all_dtypes()) 263: dtypes(get_all_dtypes()) 285: dtypes(get_all_dtypes()) 411: dtypes(get_all_dtypes()) 482: dtypes(get_all_dtypes()) 502: dtypes(get_all_dtypes()) 562: dtypes(get_all_dtypes()) 588: dtypesIfCUDA(get_all_complex_dtypes(), 589: get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater)) 745: dtypesIfCUDA(get_all_complex_dtypes(), 746: get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC, 765: dtypesIfCUDA(get_all_complex_dtypes(), 766: get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC, 801: torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater, 841: torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater, 1182: dtypes(get_all_dtypes()) 1276: dtypes(get_all_dtypes(include_bool=False, include_half=False, include_bfloat16=False)) 1286: dtypes(get_all_dtypes()) ``` </p> </details> <details> <summary> `test/test_tensor_creation_ops.py`</summary> <p> ```python 21: onlyCUDA, skipCPUIf, dtypesIfCUDA, skipMeta, get_all_device_types) 23: get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes 150: for dt in get_all_dtypes(): 160: for dt in get_all_dtypes(): 314: dtypes = [dtype for dtype in get_all_dtypes() if dtype != torch.bfloat16] 1012: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) + 1013: get_all_complex_dtypes())) 1032: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) + 1033: get_all_complex_dtypes())) 1050: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) + 1051: get_all_complex_dtypes())) 1745: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 1779: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 1868: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 1926: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 1954: do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device) 1956: do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, None) 1957: do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device) 2538: for device in get_all_device_types(): 2645: for dtype in get_all_dtypes(): 2678: dtypes((get_all_fp_dtypes(include_half=False, include_bfloat16=False) + 2679: get_all_complex_dtypes())) 2716: dtypes(get_all_fp_dtypes(include_half=False, include_bfloat16=False)) 2827: for dt in get_all_dtypes(): 2913: dtypes(get_all_dtypes(include_bool=False, include_half=False)) 2914: dtypesIfCUDA(get_all_dtypes(include_bool=False, include_half=True)) 3028: dtypes((get_all_fp_dtypes() + get_all_complex_dtypes())) 3033: dtypes((get_all_fp_dtypes() + get_all_complex_dtypes())) 3074: dtypes(get_all_dtypes(include_bool=False, include_half=False, include_complex=False)) 3075: dtypesIfCUDA(((get_all_int_dtypes() + [torch.float32, torch.float16, torch.bfloat16]) 3077: else get_all_dtypes(include_bool=False, include_half=True, include_complex=False))) 3873: dtypes(get_all_dtypes()) 3884: dtypes(get_all_dtypes(include_bool=False)) 3916: for other in get_all_dtypes(): 3922: dtypes(get_all_dtypes()) 3932: dtypes(get_all_dtypes(include_bool=False)) 3955: dtypes(get_all_dtypes(include_bool=False)) 3961: dtypes(get_all_dtypes(include_bool=False)) 3965: dtypes(get_all_dtypes()) ``` </p> </details> <details> <summary> `test/test_testing.py`</summary> <p> ```python 25:from torch.testing._internal.common_dtype import get_all_dtypes 31: dtypes((get_all_dtypes(include_half=True, include_bfloat16=False, ``` </p> </details> <details> <summary> `test/test_torch.py`</summary> <p> ```python 51: expectedAlertNondeterministic, get_all_device_types, skipXLA) 57: get_all_fp_dtypes, get_all_int_dtypes, get_all_math_dtypes, get_all_dtypes, get_all_complex_dtypes 296: for d in get_all_device_types(): 323: for device in get_all_device_types(): 324: for dt1 in get_all_dtypes(): 325: for dt2 in get_all_dtypes(): 343: all_dtypes = get_all_dtypes() 350: all_dtypes = get_all_dtypes() 781: for dtype in get_all_dtypes(): 986: for device in get_all_device_types(): 1017: for device in get_all_device_types(): 1018: for dtype in get_all_math_dtypes(device): 2792: for device in get_all_device_types(): 3186: dtypes(get_all_dtypes()) 3195: for error_dtype in get_all_dtypes(): 3203: dtypes(get_all_dtypes()) 3212: for error_dtype in get_all_dtypes(): 4539: dtypes(get_all_fp_dtypes()) 4545: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 4577: dtypes(get_all_fp_dtypes(include_half=False, include_bfloat16=False)) 4578: dtypesIfCPU((get_all_fp_dtypes(include_half=False, include_bfloat16=True))) 4579: dtypesIfCUDA((get_all_fp_dtypes(include_bfloat16=False))) 4599: dtypes((get_all_fp_dtypes(include_half=False, include_bfloat16=False))) 4600: dtypesIfCPU((get_all_dtypes(include_half=False, include_bfloat16=False, include_complex=False))) 4601: dtypesIfCUDA((get_all_dtypes(include_bfloat16=False, include_complex=False))) 4613: for p_dtype in get_all_fp_dtypes(include_half=device.startswith('cuda'), include_bfloat16=False): 4628: dtypes((get_all_fp_dtypes(include_half=False, include_bfloat16=False))) 4629: dtypesIfCUDA((get_all_fp_dtypes(include_bfloat16=False))) 4640: dtypes(get_all_fp_dtypes()) 4723: dtypes(get_all_fp_dtypes()) 4735: dtypes(get_all_fp_dtypes(include_bfloat16=False)) 4736: dtypesIfCUDA(get_all_fp_dtypes()) 4747: dtypes(get_all_fp_dtypes()) 4761: dtypes(get_all_fp_dtypes()) 4771: dtypes(get_all_fp_dtypes()) 4792: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 5302: dtypes(get_all_dtypes(include_bfloat16=False)) 5322: dtypes(get_all_dtypes(include_half=False, include_bfloat16=False)) 5323: dtypesIfCPU(get_all_dtypes(include_bfloat16=False)) 5324: dtypesIfCUDA(get_all_dtypes(include_bfloat16=False)) 5591: for dt in get_all_dtypes(): 5611: for dt in get_all_dtypes(): 5678: for dt in get_all_dtypes(): 5696: dtypesIfCUDA(set(get_all_math_dtypes('cuda'))) 5697: dtypes(set(get_all_math_dtypes('cpu'))) 5746: dtypes(get_all_dtypes()) 5780: dtypes(get_all_dtypes()) 5885: dtypes(get_all_dtypes()) 5902: dtypes(get_all_dtypes()) 5945: dtypes(get_all_dtypes()) 5979: dtypes(get_all_dtypes(include_bool=False)) 6049: dtypes(get_all_dtypes(include_bool=False)) 6092: dtypes((get_all_fp_dtypes(include_bfloat16=False, include_half=False) + 6093: get_all_complex_dtypes())) 6094: dtypesIfCPU(get_all_dtypes()) 6095: dtypesIfCUDA(get_all_dtypes()) 6122: dtypes((get_all_fp_dtypes(include_bfloat16=False, include_half=False) + 6123: get_all_complex_dtypes())) 6124: dtypesIfCPU(get_all_dtypes()) 6125: dtypesIfCUDA(get_all_dtypes()) 6163: dtypes((get_all_fp_dtypes(include_bfloat16=False, include_half=False) + 6164: get_all_complex_dtypes())) 6165: dtypesIfCPU(get_all_dtypes()) 6166: dtypesIfCUDA(get_all_dtypes()) 6190: dtypes((get_all_complex_dtypes() + 6191: get_all_int_dtypes())) 6238: dtypes(get_all_dtypes()) 6323: dtypes(get_all_dtypes()) 6389: dtypes(product(get_all_dtypes(), (torch.uint8, torch.bool))) 6699: dtypesIfCUDA(set(get_all_math_dtypes('cuda'))) 6700: dtypes(set(get_all_math_dtypes('cpu'))) 7452: dtypes(get_all_dtypes(include_bool=False)) 7461: dtypes(get_all_dtypes(include_bool=False)) 7477: dtypes(get_all_dtypes(include_bool=False)) 7496: dtypes(get_all_dtypes(include_bool=False)) 7538: dtypes(get_all_dtypes(include_bool=False)) 8162: dtypes((get_all_int_dtypes() + get_all_fp_dtypes() + 8163: get_all_complex_dtypes())) 8175: dtypes((get_all_int_dtypes() + get_all_fp_dtypes() + 8176: get_all_complex_dtypes())) ``` </p> </details> <details> <summary> `test/test_type_promotion.py`</summary> <p> ```python 14: get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes 187: for dtype in get_all_dtypes(): 262: dtypes1 = get_all_math_dtypes('cuda') 263: dtypes2 = get_all_math_dtypes(device) 339: dtypes(itertools.product(get_all_dtypes(), get_all_dtypes())) 468: for dt1 in get_all_math_dtypes(device): 469: for dt2 in get_all_math_dtypes(device): 519: for dt1 in get_all_math_dtypes(device): 520: for dt2 in get_all_math_dtypes(device): 528: for dt in get_all_math_dtypes(device): 561: for dtype in get_all_dtypes(): 766: dtypes=get_all_math_dtypes(device)) 771: dtypes=get_all_math_dtypes(device)) 782: dtypes=get_all_math_dtypes(device)) 879: dtypes = get_all_dtypes(include_bfloat16=False) 898: dtypes = get_all_dtypes(include_bfloat16=False, include_bool=False) 965: dtypesIfCUDA(itertools.product(get_all_dtypes(include_bfloat16=False, include_complex=False), 966: get_all_dtypes(include_bfloat16=False, include_complex=False))) 967: dtypes(itertools.product(get_all_dtypes(include_half=False, include_bfloat16=False, 969: get_all_dtypes(include_half=False, include_bfloat16=False, 976: return dtype in get_all_int_dtypes() + [torch.bool] 979: return dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False) ``` </p> </details> <details> <summary> `test/test_unary_ufuncs.py`</summary> <p> ```python 24: floating_types_and, all_types_and_complex_and, floating_and_complex_types_and, get_all_dtypes, get_all_math_dtypes, 25: get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes 517: dtypes((get_all_int_dtypes() + [torch.bool] + 518: get_all_fp_dtypes(include_bfloat16=False))) 596: dtypes(get_all_fp_dtypes(include_half=True, include_bfloat16=False)) 611: invalid_input_dtypes = get_all_int_dtypes() + \ 612: get_all_complex_dtypes() + \ 619: for dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False): 1048: dtypes(get_all_math_dtypes('cpu')) 1182: dtypesIfCUDA(get_all_fp_dtypes()) 1190: dtypesIfCUDA(get_all_fp_dtypes()) 1205: dtypesIfCUDA(get_all_fp_dtypes()) 1215: dtypesIfCUDA(get_all_fp_dtypes()) 1307: dtypes((get_all_dtypes(include_bool=False))) 1349: dtypes((get_all_fp_dtypes(include_half=False) + 1350: get_all_complex_dtypes())) 1351: dtypesIfCUDA((get_all_fp_dtypes(include_half=True) + 1352: get_all_complex_dtypes())) ``` </p> </details> <details> <summary> `test/test_view_ops.py`</summary> <p> ```python 19: get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes 124: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 131: dtypes(get_all_dtypes(include_bfloat16=False)) 213: for view_dtype in [get_all_fp_dtypes(), get_all_complex_dtypes()]: 220: dtypes(get_all_dtypes()) 224: for view_dtype in get_all_dtypes(): 305: dtypes(get_all_complex_dtypes(include_complex32=True)) 343: dtypes(get_all_dtypes()) 354: dtypes(get_all_dtypes()) 364: dtypes(get_all_dtypes()) 374: dtypes(get_all_dtypes()) 384: dtypes((get_all_int_dtypes() + get_all_fp_dtypes())) 395: dtypes(get_all_complex_dtypes()) 426: dtypes(get_all_complex_dtypes()) 451: dtypes(product(get_all_complex_dtypes(), get_all_dtypes())) 1263: dtypes((torch.testing.get_all_dtypes())) 1279: dtypes((torch.testing.get_all_dtypes())) 1405: dtypes((get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) + 1406: get_all_complex_dtypes())) 1471: dtypes(get_all_dtypes(include_bfloat16=False)) 1574: dtypes(get_all_dtypes()) 1601: dtypes(get_all_dtypes(include_bfloat16=False)) 1632: dtypes(*get_all_dtypes(include_bfloat16=False)) 1711: for dt in get_all_dtypes(): 1717: for dt in get_all_dtypes(): 1724: for dt in get_all_dtypes(): ``` </p> </details> I'm looking forward to your viewpoints. Thanks :) cc: mruberry kshitij12345 anjali411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/71561 Reviewed By: samdow Differential Revision: D34856571 Pulled By: mruberry fbshipit-source-id: 0dca038bcad5cf69906245c496d2e61ac3876335 (cherry picked from commit b058f67b4313143efa714ab105f36e74083131b9)	2022-03-15 20:31:41 +00:00
Emilio Castillo	3186e366d1	Support `0`s in `out_size` of `FractionalMaxPoolNd` Fixes #73624 CUDA implementation was correct :), only CPU had an out of bounds memory access Pull Request resolved: https://github.com/pytorch/pytorch/pull/73634 Approved by: jbschlosser	2022-03-03 15:39:44 +00:00
soulitzer	e6afa4f771	batch_norm_jvp: improve error message when running_{mean,var} have forward grad defined (#73655 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73655 Fixes: https://github.com/pytorch/pytorch/issues/73541 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D34586758 Pulled By: soulitzer fbshipit-source-id: 689dba3ac159e50b596381c27e23ef1fd8122a40 (cherry picked from commit 81ea860fbe3c217b0100730f4b74e8d5f9bf1b61)	2022-03-02 21:31:29 +00:00
François Lecomte	1dd3f950ba	Optimize grid sample 3d Fixes #71415 I have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case : > Fixes #64977 > > Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977). > > Brief description of the changes: > > * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities). > > * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions. > > * Changed the CPU kernels: > (1) added `bool input_requires_grad` template parameter to the `backward` function, > (2) added if branches based on it to remove `input` gradient computations if it's not requested, > (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?) > > * Changed CUDA kernel: > (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function, > (2) added if branches based on it to remove `input` gradient computations if it's not requested, > (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested. > > * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed. > > * Have not touched the CPU fallback kernel. Note: the changes number (3) are N/A in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71759	2022-02-23 19:25:17 +00:00
Rui Zhu	f41db99a56	Add simple correctness check for native MHA (#72941 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72941 Simple test for MHA, use cos similarity as metric since scaling generate mismatch. Cuda is validated, CPU fix a following (We can land this with onlyCuda flag, and remove it once CPU is also done) Test Plan: For cuda: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_native_multihead_attention_cuda_float32 2>&1 \| pastry Reviewed By: swolchok Differential Revision: D33906921 fbshipit-source-id: ad447401eb7002f22ed533d620a6b544524b3f58 (cherry picked from commit `45b778da27`)	2022-02-19 00:31:45 +00:00
Scott Wolchok	79a216ce57	Move native MHA code out of PyTorch core (#72944 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72944 Doesn't make sense to develop it in core right now. ghstack-source-id: 149456040 Test Plan: CI run MHA benchmark in benchmark_transformers.py to make sure it doesn't crash Reviewed By: zrphercule Differential Revision: D34283104 fbshipit-source-id: 4f0c7a6bc066f938ceac891320d4cf4c3f8a9cd6 (cherry picked from commit `b9df65e97c`)	2022-02-18 21:34:06 +00:00
zsef123	e0e1e0b114	Fix empty tensor handling in RReLU (#70496 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/70489 Add handling if `numel == 0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/70496 Reviewed By: zou3519, cpuhrsch Differential Revision: D34286069 Pulled By: jbschlosser fbshipit-source-id: a63797fe1ea05e5a192bc8e43327949b36ceebf2 (cherry picked from commit `b410abe85e`)	2022-02-17 14:37:47 +00:00
Scott Wolchok	ae8198121c	[PyTorch] Handle non-vectorizable parameters for native MHA CUDA rescale kernel (#72671 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72671 The existing kernel did not handle cases where D % 4 != 0 or dim_per_head % 4 != 0. Now we have a non-vectorized kernel for these cases. ghstack-source-id: 149201477 Test Plan: Updated test_nn to cover these cases. Reviewed By: zrphercule, ngimel Differential Revision: D34119371 fbshipit-source-id: 4e9b4d9b636224ef2c433593f6f236df040de782 (cherry picked from commit `f5393878e4`)	2022-02-16 18:33:31 +00:00
Scott Wolchok	ad623fdecf	[PyTorch] MHA: add test for transform_bias_rescale_qkv (#72464 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72464 We had some trouble getting this component (and this test!) right, so let's test it. ghstack-source-id: 149201478 Test Plan: new test passes Reviewed By: zrphercule Differential Revision: D33992477 fbshipit-source-id: cc377eed5d4a4412b42bdabf360601c6e52947cf (cherry picked from commit `9832867b12`)	2022-02-16 18:11:56 +00:00
Eddie Yan	e7985e3c60	Properly initialize `grad_weight` in `raw_cudnn_convolution_backward_weight_out` (#72157 ) Summary: https://github.com/pytorch/pytorch/issues/71521 attempted to fix an issue where the `test_conv_large` test was producing `NaN` values after the backward pass, yielding a bogus comparison between the result and the expected result. While tweaking the initialization of the conv layer seemed to fix this behavior, it was actually just masking the real issue, which was that `grad_weight` is not guaranteed to be initialized in `raw_cudnn_convolution_backward_weight_out` when the backward operation is split. Specifically, the `grad_weight` tensor is expected to be directly written to by a `cudnn` kernel (which does occur in most cases) so it does not need to be initialized, but splitting introduces an intermediate `grad_weight_` tensor that holds the intermediate gradients and then accumulates into `grad_weight` without initializing it first. This PR tweaks this behavior so that now accumulation is done with a zero'd tensor, and also adds the change of doing the accumulation in an accumulation dtype. The hacky workaround masking the issue is also reverted, with the safeguard against comparing `NaN` values (using the reference tensor for scale computation) kept in place. CC ngimel ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/72157 Reviewed By: malfet Differential Revision: D34147547 Pulled By: ngimel fbshipit-source-id: 056c19f727eeef96347db557528272e24eae4223 (cherry picked from commit `24c7f77a81`)	2022-02-14 17:26:37 +00:00
Ryan Spring	4f8b986e28	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: VitalyFedyunin Differential Revision: D33894937 Pulled By: jbschlosser fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851 (cherry picked from commit `6e986f91a9`)	2022-02-14 03:40:32 +00:00
kshitij12345	a2e545e6c5	pad_sequence: fix regression - support tensor (#72436 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/71365 Based on https://github.com/pytorch/pytorch/pull/72343 Thanks jbschlosser Pull Request resolved: https://github.com/pytorch/pytorch/pull/72436 Reviewed By: bdhirsh Differential Revision: D34117724 Pulled By: jbschlosser fbshipit-source-id: e5d6599d0791025e18ab36ae16c417a91554bf64 (cherry picked from commit `ffe8a0e41b`)	2022-02-10 22:36:33 +00:00
Alban Desmaison	7035738b50	Change ParameterList and ParameterDict to be able to contain any kind of objects (#70499 ) Summary: The only difference with plain list/dict now is that nn.Parameters are handled specially and registered as parameters properly. test_nn and parametrization works locally. Will see in CI if DP is fixed as well. Tentative fix for https://github.com/pytorch/pytorch/issues/36035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70499 Reviewed By: jbschlosser, alexeib Differential Revision: D34005332 Pulled By: albanD fbshipit-source-id: 7e76b0873d0fec345cb537e2a6ecba0258e662b9 (cherry picked from commit `dc1e6f8d86`)	2022-02-09 18:52:29 +00:00
Horace He	7cdbbfaee2	Revert D33716716: [pytorch][PR] Added remove_duplicate parameter to `nn.Module` Test Plan: revert-hammer Differential Revision: D33716716 (`7e8217549f`) Original commit changeset: ff1ed9980bd1 Original Phabricator Diff: D33716716 (`7e8217549f`) fbshipit-source-id: 91c3d9acc5bc731da716dd0d2485431f85f861c9 (cherry picked from commit `c81d193bf0`)	2022-02-03 09:04:29 +00:00
kshitij12345	02f6226bff	[fix] Dropout2d-3d no-batch-dim (#69885 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/69801 TODO: * [x] Update C++ API cc albanD mruberry jbschlosser walterddr kshitij12345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/69885 Reviewed By: mruberry Differential Revision: D33175470 Pulled By: jbschlosser fbshipit-source-id: c9d7d9e0f59ba290a0157725c338a345f3d58b9f (cherry picked from commit `7e4271a156`)	2022-02-02 16:40:32 +00:00
kshitij12345	aa5dab02b2	[fix] EmbeddingBag segfault for out-of-bounds idx (#71904 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/71094 Added checks for out-of-bound indices Pull Request resolved: https://github.com/pytorch/pytorch/pull/71904 Reviewed By: jbschlosser, VitalyFedyunin Differential Revision: D33893387 Pulled By: ngimel fbshipit-source-id: 0ba7038bd7e10c6fa6700646a0fe755b73db0ec9 (cherry picked from commit `4d6ae2e3f4`)	2022-02-02 00:04:26 +00:00
pejato	b8a4ee5e35	Clean up old warnings in F.interpolate (#72093 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/71720 This PR removes the old warnings for `recompute_scale_factor` and `align_corners`. Looking at this, I realize that the tests I modified don't really catch whether or not a warning is created for `recompute_scale_factor`. If desired, I can add a couple lines into the tests there to pass a floating point in the `scale_factors` kwarg, along with `recompute_scale_factor=None`. Let me know how this looks, thanks so much! Pull Request resolved: https://github.com/pytorch/pytorch/pull/72093 Reviewed By: mruberry Differential Revision: D33917615 Pulled By: albanD fbshipit-source-id: e822f0a15b813ecf312cdc6ed0b693e7f1d1ca89 (cherry picked from commit `c14852b85c`)	2022-02-01 21:18:29 +00:00
Horace He	7e8217549f	Added remove_duplicate parameter to `nn.Module` (#39 ) Summary: Pull Request resolved: https://github.com/pytorch/torchrec/pull/39 Pull Request resolved: https://github.com/facebookresearch/torchrec/pull/6 This makes it so that shared parameters get their own entry in `named_parameters`. More broadly, this makes it so that ``` params_and_buffers = {mod.named_named_parameters(remove_duplicate=False), mod.named_buffers(remove_duplicate=False)} _stateless.functional_call(mod, params_and_buffers, args, kwargs) ``` is identical to calling the original module's forwards pass. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/71542 Reviewed By: jbschlosser, albanD Differential Revision: D33716716 Pulled By: Chillee fbshipit-source-id: ff1ed9980bd1a3f7ebaf695ee5e401202b543213 (cherry picked from commit `d6e3ad3cd0`)	2022-02-01 18:34:58 +00:00
Nikita Shulga	74c44ba9d6	Revert D33850228: [pytorch][PR] Implement Tanh Gelu Approximation Test Plan: revert-hammer Differential Revision: D33850228 (`23d03025dc`) Original commit changeset: 3cc33fb298e4 Original Phabricator Diff: D33850228 (`23d03025dc`) fbshipit-source-id: 9436e7df73c2b2e2011f321674f24973316d3692 (cherry picked from commit `c9efb58223`)	2022-01-31 17:44:19 +00:00
Ryan Spring	23d03025dc	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: cpuhrsch Differential Revision: D33850228 Pulled By: jbschlosser fbshipit-source-id: 3cc33fb298e480d7ecc5c67716da019d60c6ab33 (cherry picked from commit `3a53b3e94f`)	2022-01-31 17:07:45 +00:00
Jake Tae	ca61292465	Add append method for nn.Sequential (#71326 ) Summary: Partially addresses https://github.com/pytorch/pytorch/issues/71249, and potentially supersedes https://github.com/pytorch/pytorch/pull/20274. Pull Request resolved: https://github.com/pytorch/pytorch/pull/71326 Reviewed By: cpuhrsch Differential Revision: D33855047 Pulled By: jbschlosser fbshipit-source-id: a3a682e206f93b4c52bc3405e2f7b26aea6635ea (cherry picked from commit `c0b27bbf2a`)	2022-01-31 16:54:12 +00:00
Joel Schlosser	cb823d9f07	Revert D33744717: [pytorch][PR] Implement Tanh Gelu Approximation Test Plan: revert-hammer Differential Revision: D33744717 (`f499ab9cef`) Original commit changeset: d64532a562ed Original Phabricator Diff: D33744717 (`f499ab9cef`) fbshipit-source-id: 396c3f63de5865f894dbc353d0790a01a624be93 (cherry picked from commit `e9fb2d1db1`)	2022-01-28 18:35:01 +00:00
Ryan Spring	f499ab9cef	Implement Tanh Gelu Approximation (#61439 ) Summary: 1. Implements https://github.com/pytorch/pytorch/issues/39853 2. Adds approximate boolean flag to Gelu 3. Enables Tanh Gelu approximation 4. Adds double backward support for Gelu 5. Enable Tanh Gelu in NvFuser ``` def gelu(x, approximate : str = 'none'): if approximate == 'tanh': # sqrt(2/pi) = 0.7978845608028654 return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0)))) else: return x * normcdf(x) ``` Linking XLA PR - https://github.com/pytorch/xla/pull/3039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439 Reviewed By: mikaylagawarecki Differential Revision: D33744717 Pulled By: jbschlosser fbshipit-source-id: d64532a562ed53247bb4fa52bb16722634d5c187 (cherry picked from commit `4713dd9cca`)	2022-01-28 16:59:09 +00:00
vfdev-5	eeda31fa08	Added antialias flag to interpolate (CUDA, bilinear and bicubic) (#70930 ) Summary: Description: - Added antialias flag to interpolate (CUDA) - forward and backward for bicubic mode - added tests Previous PR for CPU bilinear, https://github.com/pytorch/pytorch/pull/65142 Previous PR for CPU bicubic, https://github.com/pytorch/pytorch/pull/68819 ### Benchmarks <details> <summary> Bilinear forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 2851.2 \| 874.1 \| 57.1 channels_last non-contiguous torch.float32 \| 2856.1 \| 1155.8 \| 130.6 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 3705.9 \| 1005.8 \| 66.3 channels_last non-contiguous torch.float32 \| 3742.9 \| 1332.8 \| 143.5 Times are in microseconds (us). [------------------------------------ Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 1768.0 \| 725.2 \| 77.9 channels_last non-contiguous torch.float32 \| 1753.7 \| 942.5 \| 144.0 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 9522.6 \| 2593.8 \| 157.8 channels_last non-contiguous torch.float32 \| 9513.5 \| 3622.7 \| 241.5 Times are in microseconds (us). [----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 2240.1 \| 565.5 \| 93.3 channels_last non-contiguous torch.float32 \| 2244.2 \| 972.7 \| 170.8 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 1441.3 \| 386.1 \| 22.3 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 1815.2 \| 376.8 \| 27.8 Times are in microseconds (us). [-------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) --------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 962.3 \| 400.0 \| 29.4 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) -------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 4749.7 \| 910.1 \| 63.7 Times are in microseconds (us). [------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) -------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 1098.1 \| 272.0 \| 36.4 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic forward pass, PIL, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 4522.4 \| 1406.7 \| 170.3 channels_last non-contiguous torch.float32 \| 4530.0 \| 1435.4 \| 242.2 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 5726.4 \| 1628.6 \| 164.0 channels_last non-contiguous torch.float32 \| 5722.6 \| 1665.6 \| 234.7 Times are in microseconds (us). [------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) ------------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 2909.1 \| 1461.5 \| 276.9 channels_last non-contiguous torch.float32 \| 2892.9 \| 1458.7 \| 345.1 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 14699.2 \| 4283.9 \| 407.1 channels_last non-contiguous torch.float32 \| 14711.3 \| 4321.1 \| 477.0 Times are in microseconds (us). [----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -----------------------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 3467.0 \| 980.0 \| 339.2 channels_last non-contiguous torch.float32 \| 3465.2 \| 982.3 \| 407.8 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 2396.7 \| 877.8 \| 68.1 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 3068.2 \| 777.3 \| 64.7 Times are in microseconds (us). [-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 1540.2 \| 829.3 \| 100.4 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 7919.5 \| 1467.8 \| 151.6 Times are in microseconds (us). [------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------------------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: --------------------------------------------------------------------------------------------------------------- contiguous torch.float32 \| 1695.7 \| 631.2 \| 117.7 Times are in microseconds (us). ``` </details> <details> <summary> Bilinear backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) ------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 4686.8 \| 215.7 channels_last non-contiguous torch.float32 \| 5101.1 \| 220.5 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) ------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 6011.2 \| 204.4 channels_last non-contiguous torch.float32 \| 6396.0 \| 210.0 Times are in microseconds (us). [------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 2035.6 \| 250.2 channels_last non-contiguous torch.float32 \| 1589.6 \| 252.5 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 11392.5 \| 256.5 channels_last non-contiguous torch.float32 \| 11640.2 \| 263.9 Times are in microseconds (us). [------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 11769.6 \| 465.9 channels_last non-contiguous torch.float32 \| 12407.0 \| 474.4 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) ----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 3931.0 \| 133.3 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) ----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 5594.8 \| 133.9 Times are in microseconds (us). [---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 1272.6 \| 133.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 10618.1 \| 134.0 Times are in microseconds (us). [--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 11082.2 \| 154.6 Times are in microseconds (us). ``` </details> <details> <summary> Bicubic backward pass, PTH CPU and PTH CUDA </summary> Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` - Measure only backward op Torch version: 1.11.0a0+gitd032369 Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 8 [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 6791.2 \| 618.9 channels_last non-contiguous torch.float32 \| 7125.2 \| 622.9 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 8806.2 \| 600.3 channels_last non-contiguous torch.float32 \| 9167.6 \| 607.5 Times are in microseconds (us). [-------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 3683.6 \| 693.8 channels_last non-contiguous torch.float32 \| 3617.4 \| 695.0 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 17548.2 \| 779.4 channels_last non-contiguous torch.float32 \| 17966.2 \| 786.5 Times are in microseconds (us). [------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ---------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 28.4 \| 1.6 channels_last non-contiguous torch.float32 \| 28.4 \| 1.6 Times are in milliseconds (ms). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) -----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 6266.1 \| 208.5 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) -----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 8218.3 \| 200.8 Times are in microseconds (us). [----- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) -----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 3458.9 \| 231.9 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 15729.3 \| 261.6 Times are in microseconds (us). [---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----] \| 1.11.0a0+gitd032369 cpu \| 1.11.0a0+gitd032369 cuda 8 threads: ----------------------------------------------------------------------------- contiguous torch.float32 \| 26279.8 \| 547.0 Times are in microseconds (us). ``` </details> Code is moved from torchvision: https://github.com/pytorch/vision/pull/4211 and optimized Pull Request resolved: https://github.com/pytorch/pytorch/pull/70930 Reviewed By: zou3519 Differential Revision: D33817902 Pulled By: jbschlosser fbshipit-source-id: d63a620f8972ff36b63841f0bc6c820466f58f69 (cherry picked from commit `d358cfdb7d`)	2022-01-27 20:43:08 +00:00
Khushi Agrawal	dfcbe059ec	Obliviate ALL_TENSORTYPES and ALL_TENSORTYPES2. (#71153 ) Summary: Hi, The PR fixes https://github.com/pytorch/pytorch/issues/71096. It aims to scan all the test files and replace ` ALL_TENSORTYPES` and `ALL_TENSORTYPES2` with `get_all_fp_dtypes`. I'm looking forward to your viewpoints! Thanks! cc: janeyx99 kshitij12345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/71153 Reviewed By: jbschlosser, mruberry Differential Revision: D33533346 Pulled By: anjali411 fbshipit-source-id: 75e79ca2756c1ddaf0e7e0289257fca183a570b3 (cherry picked from commit `da54b54dc5`)	2022-01-26 03:25:02 +00:00
eqy	166d4e4201	Change `test_conv_large` parameter initialization (#71521 ) Summary: This PR twiddles the parameters of the conv layer in `test_conv_large` to better avoid NaN values. Previously, this test would cause a NaN to be computed for `scale` (propagated from `.mean()` on the `.grad` tensor). This NaN would then be propagated to the scaled gradients via division, resulting in a bogus `assertEqual` check as `NaN == NaN` is by default true. (This behavior was observed on V100 and A100). To improve visibility of failures in the event of NaNs in `grad1`, scale is now computed from `grad2`. Interestingly enough, we discovered this issue when trying out some less common setups that broke this test; it turns out those breakages were cases where there were no NaN values (leading to an actual `assertEqual` check that would fail for `float16`). CC ptrblck ngimel puririshi98 Pull Request resolved: https://github.com/pytorch/pytorch/pull/71521 Reviewed By: anjali411 Differential Revision: D33776705 Pulled By: ngimel fbshipit-source-id: a1ec4792cba04c6322b22ef5b80ce08579ea4cf6 (cherry picked from commit `d207bd9b87`)	2022-01-26 02:32:15 +00:00
Emilio Castillo	6848e0dae5	Fix RNN modules with inputs shapes containing-0 in CUDA (#71696 ) Summary: We found a discrepancy between cpu & CUDA when using RNN modules where input shapes containing 0s would cause an invalid configuration argument error in CUDA (kernel grid size is 0), while returning a valid tensor in CPU cases. A reproducer: ``` import torch x = torch.zeros((5, 0, 3)).cuda() gru = torch.nn.GRU(input_size=3, hidden_size=4).to("cuda") gru(x) ``` Run with `CUDA_LAUNCH_BLOCKING=1` set. cc ngimel albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/71696 Reviewed By: mikaylagawarecki Differential Revision: D33743674 Pulled By: ngimel fbshipit-source-id: e9334175d10969fdf1f9c63985910d944bbd26e7 (cherry picked from commit `70838ba69b`)	2022-01-25 18:32:13 +00:00
kshitij12345	0a2cdd18f3	nice error msg from load_state_dict for non-tensor value (#70596 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/67549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70596 Reviewed By: anjali411 Differential Revision: D33710750 Pulled By: jbschlosser fbshipit-source-id: 870b5fafffcd005fd4fcd62f865542739c133805 (cherry picked from commit `da374fbc58`)	2022-01-21 22:02:13 +00:00
mingfeima	84b1c9798c	add BFloat16 support for AvgPool2d on CPU (#66927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66927 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D33353198 Pulled By: VitalyFedyunin fbshipit-source-id: 1aeaa4bb90ac99210b8f6051c09d6995d06ce3a1	2022-01-14 07:59:10 -08:00
mingfeima	910c01020e	add BFloat16 support for AdaptiveMaxPool2d on CPU (#66929 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66929 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D33353199 Pulled By: VitalyFedyunin fbshipit-source-id: d402d5deb7ca766259ca42118ddc16625e134c4c	2022-01-13 20:00:42 -08:00
Jake Tae	eac3decf93	ModuleList concatenation (#70887 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/70441. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70887 Reviewed By: ejguan Differential Revision: D33555431 Pulled By: albanD fbshipit-source-id: ce42459ee46a611e98e89f02686acbac16b6b668	2022-01-13 15:31:07 -08:00
mingfeima	385773cb77	add BFloat16 support for MaxPool2d on CPU (#56903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56903 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D28836791 Pulled By: VitalyFedyunin fbshipit-source-id: e03d55cc30dfa3628f096938fbad34b1031948af	2022-01-12 14:20:20 -08:00
Ilya Persky	a8612cd72a	Skip failing tests in test_nn if compiled without LAPACK (#70913 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/70912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70913 Reviewed By: mruberry Differential Revision: D33534840 Pulled By: albanD fbshipit-source-id: 0facf5682140ecd7a78edb34b9cd997f9319e084	2022-01-11 12:21:18 -08:00
George Qi	d7db5fb462	ctc loss no batch dim support (#70092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70092 Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D33280068 Pulled By: george-qi fbshipit-source-id: 3278fb2d745a396fe27c00fb5f40df0e7f584f81	2022-01-07 14:33:22 -08:00
Bin Bao	f135438d3b	Dispatch to at::convolution intead of at::_convolution in _convolution_double_backward (#70661 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70661 Dispatching to at::convolution can make Lazy Tensor trace the right convolution op. Test Plan: pytest test/test_nn.py -k test_conv_double_backward_strided_with_3D_input_and_weight Reviewed By: wconstab, jbschlosser Differential Revision: D33428780 Pulled By: desertfire fbshipit-source-id: 899e4135588ea33fff23d16103c25d9bcd3f902c	2022-01-07 07:53:46 -08:00
Joel Schlosser	e6befbe85c	Add flag to optionally average output attention weights across heads (#70055 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/47583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70055 Reviewed By: bhosmer Differential Revision: D33457866 Pulled By: jbschlosser fbshipit-source-id: 17746b3668b0148c1e1ed8333227b7c42f1e3bf5	2022-01-06 17:32:37 -08:00
Joel Schlosser	7b8f73dd32	No-batch-dim support for ConvNd (#70506 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70506 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D33355034 Pulled By: jbschlosser fbshipit-source-id: 5a42645299b1d82cee7d461826acca1c5b35a71c	2022-01-06 16:53:50 -08:00
Jake Tae	b7742b437a	Allow RNN hidden_size to be 0 (#70556 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/56767. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70556 Reviewed By: ngimel Differential Revision: D33455156 Pulled By: jbschlosser fbshipit-source-id: 5dc57b09d7beb6ae81dfabc318e87c109bb4e6ae	2022-01-06 14:18:36 -08:00
Jane Xu	c00d33033c	Remove repeat test for types in test nn (#70872 ) Summary: Helps fix a part of https://github.com/pytorch/pytorch/issues/69865 The first commit just migrates everything as is. The second commit uses the "device" variable instead of passing "cuda" everywhere Pull Request resolved: https://github.com/pytorch/pytorch/pull/70872 Reviewed By: jbschlosser Differential Revision: D33455941 Pulled By: janeyx99 fbshipit-source-id: 9d9ec8c95f1714c40d55800e652ccd69b0c314dc	2022-01-06 09:20:02 -08:00
soulitzer	3051aabd0e	Add forward AD formulas for convolution and some others (#69956 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69956 Test Plan: Imported from OSS Reviewed By: albanD, bdhirsh Differential Revision: D33235974 Pulled By: soulitzer fbshipit-source-id: ea60d687edc5d62d92f3fd3cb6640421d32c908c	2022-01-06 08:39:51 -08:00
Joel Schlosser	b60b1b100f	Set cuDNN deterministic flag for test_conv_double_backward_cuda (#69941 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/69833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/69941 Reviewed By: george-qi Differential Revision: D33430727 Pulled By: jbschlosser fbshipit-source-id: 4a250bd0e5460ee631730afe0ab68ba72f37d292	2022-01-05 10:05:56 -08:00
kshitij12345	7bfaa230be	[nn] adaptive_avg_pool{1/2/3}d : Error on negative `output_size` (#70488 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/70232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70488 Reviewed By: H-Huang Differential Revision: D33367289 Pulled By: jbschlosser fbshipit-source-id: 6b7b89d72c4e1e049ad6a0addb22a261c28ddb4c	2021-12-30 14:42:11 -08:00
mingfeima	401a6b682b	add BFloat16 support for AdaptiveAvgPool2d on CPU (#56902 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56902 Test Plan: Imported from OSS Reviewed By: mikaylagawarecki Differential Revision: D28836789 Pulled By: VitalyFedyunin fbshipit-source-id: caac5e5b15190b8010bbfbc6920aa44032208ee7	2021-12-30 11:58:37 -08:00
vfdev	d2abf3f981	Added antialias flag to interpolate (CPU only, bicubic) (#68819 ) Summary: Description: - Added antialias flag to interpolate (CPU only) - forward and backward for bicubic mode - added tests Previous PR for bilinear, https://github.com/pytorch/pytorch/pull/65142 ### Benchmarks <details> <summary> Forward pass, CPU. PTH interpolation vs PIL </summary> Cases: - PTH RGB 3 Channels, float32 vs PIL RGB uint8 (apples vs pears) - PTH 1 Channel, float32 vs PIL 1 Channel Float Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF, Num threads: 1 [------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 4.5 \| 5.2 channels_last non-contiguous torch.float32 \| 4.5 \| 5.3 Times are in milliseconds (ms). [------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 5.7 \| 6.4 channels_last non-contiguous torch.float32 \| 5.7 \| 6.4 Times are in milliseconds (ms). [------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) --------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 3.0 \| 4.0 channels_last non-contiguous torch.float32 \| 2.9 \| 4.1 Times are in milliseconds (ms). [------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 14.7 \| 17.1 channels_last non-contiguous torch.float32 \| 14.8 \| 17.2 Times are in milliseconds (ms). [------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -------------------] \| Reference, PIL 8.4.0, mode: RGB \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 3.5 \| 3.9 channels_last non-contiguous torch.float32 \| 3.5 \| 3.9 Times are in milliseconds (ms). [---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) ---------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 2.4 \| 1.8 Times are in milliseconds (ms). [---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) ---------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 3.1 \| 2.2 Times are in milliseconds (ms). [---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ----------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 1.6 \| 1.4 Times are in milliseconds (ms). [--------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ---------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 7.9 \| 5.7 Times are in milliseconds (ms). [--------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ---------] \| Reference, PIL 8.4.0, mode: F \| 1.11.0a0+gitb0bdf58 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 1.7 \| 1.3 Times are in milliseconds (ms). ``` </details> Code is moved from torchvision: https://github.com/pytorch/vision/pull/3810 and https://github.com/pytorch/vision/pull/4208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/68819 Reviewed By: mikaylagawarecki Differential Revision: D33339117 Pulled By: jbschlosser fbshipit-source-id: 6a0443bbba5439f52c7dbc1be819b75634cf67c4	2021-12-29 14:04:43 -08:00
George Qi	8af39b7668	AdaptiveLogSoftmaxWithLoss no_batch_dim support (#69054 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69054 Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D33200166 Pulled By: george-qi fbshipit-source-id: 9d953744351a25f372418d2a64e8402356d1e9b7	2021-12-29 10:25:26 -08:00
soulitzer	3116d87024	Add forward AD formulas for `{adaptive_,fractional_,}max_pool{2,3}d_{backward,}` (#69884 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69884 Also fixes: https://github.com/pytorch/pytorch/issues/69322, https://github.com/pytorch/pytorch/issues/69325 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D33093039 Pulled By: soulitzer fbshipit-source-id: b9a522a00f4e9e85974888de5058de07280f8f66	2021-12-23 15:51:09 -08:00
soulitzer	5651e1e3ad	Add auto_linear formulas and some others (#69727 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69727 Still need to test the backward ones. We would need to update gradgradcheck to check forward over backward. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D33031728 Pulled By: soulitzer fbshipit-source-id: 86c59df5d2196b5c8dbbb1efed9321e02ab46d30	2021-12-20 12:15:25 -08:00
Albert Liang	0d06616c47	Add `dict` methods to `ParameterDict` (#69403 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/68476 We implemented all of the following `dict` methods for `ParameterDict` - `get ` - `setdefault` - `popitem` - `fromkeys` - `copy` - `__or__` - `__ior__` - `__reversed__` - `__ror__` The behavior of these new methods matches the expected behavior of python `dict` as defined by the language itself: https://docs.python.org/3/library/stdtypes.html#typesmapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/69403 Reviewed By: albanD Differential Revision: D33187111 Pulled By: jbschlosser fbshipit-source-id: ecaa493837dbc9d8566ddbb113b898997e2debcb	2021-12-17 10:15:47 -08:00
Rui Zhu	46ace4ac33	Add support for masked_softmax when softmax_elements > 1024 & corresponding unit tests (#69924 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69924 Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax Reviewed By: ngimel Differential Revision: D32819181 fbshipit-source-id: 6838a11d3554ec8e1bd48f1c2c7b1ee3a4680995	2021-12-15 16:44:15 -08:00
kshitij12345	e8d5c7cf7f	[nn] mha : no-batch-dim support (python) (#67176 ) Summary: Reference: https://github.com/pytorch/pytorch/issues/60585 * [x] Update docs * [x] Tests for shape checking Tests take roughly 20s on system that I use. Below is the timings for slowest 20 tests. ``` pytest test/test_modules.py -k _multih --durations=20 ============================================================================================== test session starts =============================================================================================== platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /home/kshiteej/Pytorch/pytorch_no_batch_mha, configfile: pytest.ini plugins: hypothesis-6.23.2, repeat-0.9.1 collected 372 items / 336 deselected / 36 selected test/test_modules.py ..............ssssssss.............. [100%] ================================================================================================ warnings summary ================================================================================================ ../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:73 test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float32 /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:73: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( -- Docs: https://docs.pytest.org/en/stable/warnings.html ============================================================================================== slowest 20 durations ============================================================================================== 8.66s call test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_MultiheadAttention_cuda_float64 2.02s call test/test_modules.py::TestModuleCPU::test_gradgrad_nn_MultiheadAttention_cpu_float64 1.89s call test/test_modules.py::TestModuleCUDA::test_grad_nn_MultiheadAttention_cuda_float64 1.01s call test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float32 0.51s call test/test_modules.py::TestModuleCPU::test_grad_nn_MultiheadAttention_cpu_float64 0.46s call test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiheadAttention_cuda_float32 0.45s call test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiheadAttention_cuda_float64 0.44s call test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiheadAttention_cuda_float32 0.21s call test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_cuda_float64 0.21s call test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_cuda_float32 0.18s call test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiheadAttention_cuda_float64 0.17s call test/test_modules.py::TestModuleCPU::test_non_contiguous_tensors_nn_MultiheadAttention_cpu_float32 0.16s call test/test_modules.py::TestModuleCPU::test_non_contiguous_tensors_nn_MultiheadAttention_cpu_float64 0.11s call test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float64 0.08s call test/test_modules.py::TestModuleCPU::test_pickle_nn_MultiheadAttention_cpu_float32 0.08s call test/test_modules.py::TestModuleCPU::test_pickle_nn_MultiheadAttention_cpu_float64 0.06s call test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiheadAttention_cuda_float64 0.06s call test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiheadAttention_cuda_float32 0.06s call test/test_modules.py::TestModuleCPU::test_forward_nn_MultiheadAttention_cpu_float32 0.06s call test/test_modules.py::TestModuleCPU::test_forward_nn_MultiheadAttention_cpu_float64 ============================================================================================ short test summary info ============================================================================================= =========================================================================== 28 passed, 8 skipped, 336 deselected, 2 warnings in 19.71s =========================================================================== ``` cc albanD mruberry jbschlosser walterddr Pull Request resolved: https://github.com/pytorch/pytorch/pull/67176 Reviewed By: dagitses Differential Revision: D33094285 Pulled By: jbschlosser fbshipit-source-id: 0dd08261b8a457bf8bad5c7f3f6ded14b0beaf0d	2021-12-14 13:21:21 -08:00
Rui Zhu	1a299d8f1b	Add support for transformer layout of masked_softmax (#69272 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69272 In transformer encoder and MHA, masked_softmax's mask is a 2D tensor (B, D), where input is a 4D tensor (B, H, D, D). This mask could be simply broadcasted to a (B, H, D, D) like input, and then do a regular masked_softmax, however it will bring the problem of non-contiguous mask & consume more memory. In this diff, we maintained mask's shape unchanged, while calc the corresponding mask for input in each cuda thread. This new layout is not currently supported in CPU yet. Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax Reviewed By: ngimel Differential Revision: D32605557 fbshipit-source-id: ef37f86981fdb2fb264d776f0e581841de5d68d2	2021-12-14 10:51:58 -08:00
Joel Schlosser	fc37e5b3ed	Hook up general convolution to convolution_backward (#69584 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69584 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D32936380 Pulled By: jbschlosser fbshipit-source-id: c6fdd88db33bd1a9d0eabea47ae09a4d5b170e92	2021-12-12 17:30:01 -08:00
Joel Schlosser	f0e98dcbd3	General convolution_backward function (#69044 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69044 Test Plan: Imported from OSS Reviewed By: zou3519, albanD, H-Huang Differential Revision: D32708818 Pulled By: jbschlosser fbshipit-source-id: e563baa3197811d8d51553fc83718ace2f8d1b7a	2021-12-12 15:53:38 -08:00
Rui Zhu	aab67c6dff	Add native masked_softmax (#69268 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69268 This diff enabled native masked softmax on CUDA, also expanded our current warp_softmax to accept masking. The mask in this masked softmax has to be the same shape as input, and has to be contiguous. In a following diff I will submit later, I will have encoder mask layout included, where input is BHDD and mask is BD. Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax Reviewed By: ngimel Differential Revision: D32338419 fbshipit-source-id: 48c3fde793ad4535725d9dae712db42e2bdb8a49	2021-12-09 23:29:45 -08:00
kshitij12345	7407e3d6fd	[fix] cross_entropy : fix weight with ignore_index and label_smoothing (#69511 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/69339 cc albanD mruberry jbschlosser walterddr Pull Request resolved: https://github.com/pytorch/pytorch/pull/69511 Reviewed By: mrshenli Differential Revision: D32951935 Pulled By: jbschlosser fbshipit-source-id: 482eae851861a32f96bd6231dd3448fb6d44a015	2021-12-08 12:08:33 -08:00
jjsjann123	3c1e2ff9eb	fixing layer_norm cuda bug (#69210 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/69208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/69210 Reviewed By: H-Huang Differential Revision: D32764811 Pulled By: ngimel fbshipit-source-id: fb4201fe5f2284fcb22e36bc1029eef4a21b09bf	2021-12-01 15:46:47 -08:00
Kurt Mohler	d507fd63f3	Check that block height and width are positive in `nn.Fold` (#69048 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/68875 cc albanD mruberry jbschlosser walterddr Pull Request resolved: https://github.com/pytorch/pytorch/pull/69048 Reviewed By: samdow Differential Revision: D32729307 Pulled By: jbschlosser fbshipit-source-id: 162cafb005873012d900d86997d07640967038c0	2021-12-01 10:08:47 -08:00
Omkar Salpekar	8e343ba5db	Revert D32611368: [pytorch][PR] Initial version of general convolution_backward Test Plan: revert-hammer Differential Revision: D32611368 (`445b31abff`) Original commit changeset: 26d759b7c908 fbshipit-source-id: e91f45f0f31150e60d657a3964b7e42027beff58	2021-11-23 13:39:36 -08:00
Joel Schlosser	445b31abff	Initial version of general convolution_backward (#65219 ) Summary: Towards [convolution consolidation](https://fb.quip.com/tpDsAYtO15PO). Introduces the general `convolution_backward` function that uses the factored-out backend routing logic from the forward function. Some notes: * `finput` is now recomputed in the backward pass for the slow 2d / 3d kernels instead of being saved from the forward pass. The logic for is based on the forward computation and is present in `compute_finput2d` / `compute_finput3d` functions in `ConvUtils.h`. * Using structured kernels for `convolution_backward` requires extra copying since the backend-specific backward functions return tensors. Porting to structured is left as future work. * The tests that check the routing logic have been renamed from `test_conv_backend_selection` -> `test_conv_backend` and now also include gradcheck validation using an `autograd.Function` hooking up `convolution` to `convolution_backward`. This was done to ensure that gradcheck passes for the same set of inputs / backends. The forward pass routing is done as shown in this flowchart (probably need to download it for it to be readable since it's ridiculous): ![conv_routing_graph md](https://user-images.githubusercontent.com/75754324/137186002-5bca75ca-f911-4e61-8245-ec07af841506.png) ![conv_nogroup_routing_graph md](https://user-images.githubusercontent.com/75754324/139731619-9d0d436e-cce3-4bc3-8eaf-d469f667f0d7.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/65219 Reviewed By: mruberry Differential Revision: D32611368 Pulled By: jbschlosser fbshipit-source-id: 26d759b7c908ab8f19ecce627acea7bd3d5f59ba	2021-11-23 08:19:45 -08:00
soulitzer	7bb401a4c9	Add forward AD support for miscellanous operators (#67820 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67820 Original PR here: https://github.com/pytorch/pytorch/pull/67040 Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D32314423 Pulled By: soulitzer fbshipit-source-id: ecd898dc903692cab084f6922a1d86986f957b1b	2021-11-19 14:31:06 -08:00
jiej	ca92111758	Add native_dropout (#63937 ) Summary: Adds native_dropout to have a reasonable target for torchscript in auto diff. native_dropout has scale and train as arguments in its signature, this makes native_dropout more consistent with other operators and removes conditionals in the autodiff definition. cc gmagogsfm Pull Request resolved: https://github.com/pytorch/pytorch/pull/63937 Reviewed By: mruberry Differential Revision: D32477657 Pulled By: ngimel fbshipit-source-id: d37b137a37acafa50990f60c77f5cea2818454e4	2021-11-18 19:41:10 -08:00
kshitij12345	d5d2096dab	[testing] make @dtypes mandatory when using @dtypesIf (#68186 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53647 With this if a test forgets to add `dtypes` while using `dtypesIf`, following error is raised ``` AssertionError: dtypes is mandatory when using dtypesIf however 'test_exponential_no_zero' didn't specify it ``` Tested Locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/68186 Reviewed By: VitalyFedyunin Differential Revision: D32468581 Pulled By: mruberry fbshipit-source-id: 805e0855f988b77a5d8d4cd52b31426c04c2200b	2021-11-18 08:29:31 -08:00
vfdev-5	3da2e09c9b	Added antialias flag to interpolate (CPU only, bilinear) (#65142 ) Summary: Description: - Added antialias flag to interpolate (CPU only) - forward and backward for bilinear mode - added tests ### Benchmarks <details> <summary> Forward pass, CPU. PTH interpolation vs PIL </summary> Cases: - PTH RGB 3 Channels, float32 vs PIL RGB uint8 (apply vs pears) - PTH 1 Channel, float32 vs PIL 1 Channel Float Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112 ``` # OMP_NUM_THREADS=1 python bench_interp_aa_vs_pillow.py Torch config: PyTorch built with: - GCC 9.3 - C++ Version: 201402 - OpenMP 201511 (a.k.a. OpenMP 4.5) - CPU capability usage: AVX2 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_75,code=sm_75 - CuDNN 8.0.5 - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, Num threads: 1 [------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (320, 196) ------------------------] \| Reference, PIL 8.3.2, mode: RGB \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 2.9 \| 3.1 channels_last non-contiguous torch.float32 \| 2.6 \| 3.6 Times are in milliseconds (ms). [------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (460, 220) ------------------------] \| Reference, PIL 8.3.2, mode: RGB \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 3.4 \| 4.0 channels_last non-contiguous torch.float32 \| 3.4 \| 4.8 Times are in milliseconds (ms). [------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (120, 96) -------------------------] \| Reference, PIL 8.3.2, mode: RGB \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 1.6 \| 1.8 channels_last non-contiguous torch.float32 \| 1.6 \| 1.9 Times are in milliseconds (ms). [----------------------- Downsampling: torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------------------] \| Reference, PIL 8.3.2, mode: RGB \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 9.0 \| 11.3 channels_last non-contiguous torch.float32 \| 8.9 \| 12.5 Times are in milliseconds (ms). [----------------------- Downsampling: torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------------------] \| Reference, PIL 8.3.2, mode: RGB \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------------------------- channels_first contiguous torch.float32 \| 2.1 \| 1.8 channels_last non-contiguous torch.float32 \| 2.1 \| 3.4 Times are in milliseconds (ms). [--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (320, 196) --------------] \| Reference, PIL 8.3.2, mode: F \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 1.2 \| 1.0 Times are in milliseconds (ms). [--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (460, 220) --------------] \| Reference, PIL 8.3.2, mode: F \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 1.4 \| 1.3 Times are in milliseconds (ms). [--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------] \| Reference, PIL 8.3.2, mode: F \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 719.9 \| 599.9 Times are in microseconds (us). [-------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------] \| Reference, PIL 8.3.2, mode: F \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 3.7 \| 3.5 Times are in milliseconds (ms). [-------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------] \| Reference, PIL 8.3.2, mode: F \| 1.10.0a0+git1e87d91 1 threads: ------------------------------------------------------------------------------ contiguous torch.float32 \| 834.4 \| 605.7 Times are in microseconds (us). ``` </details> Code is moved from torchvision: https://github.com/pytorch/vision/pull/4208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/65142 Reviewed By: mrshenli Differential Revision: D32432405 Pulled By: jbschlosser fbshipit-source-id: b66c548347f257c522c36105868532e8bc1d4c6d	2021-11-17 09:10:15 -08:00
vfdev-5	6adbe044e3	Added nearest-exact interpolation mode (#64501 ) Summary: Added "nearest-exact" interpolation mode to fix the issues: https://github.com/pytorch/pytorch/issues/34808 and https://github.com/pytorch/pytorch/issues/62237. Description: As we can not fix "nearest" mode without large impact on already trained model [it was suggested](https://github.com/pytorch/pytorch/pull/64501#pullrequestreview-749771815) to introduce new mode instead of fixing exising "nearest" mode. - New mode "nearest-exact" performs index computation for nearest interpolation to match scikit-image, pillow, TF2 and while "nearest" mode still match opencv INTER_NEAREST, which appears to be buggy, see https://ppwwyyxx.com/blog/2021/Where-are-Pixels/#Libraries. "nearest": ``` input_index_f32 = output_index * scale input_index = floor(input_index_f32) ``` "nearest-exact" ``` input_index_f32 = (output_index + 0.5) * scale - 0.5 input_index = round(input_index_f32) ``` Comparisions with other libs: https://gist.github.com/vfdev-5/a5bd5b1477b1c82a87a0f9e25c727664 PyTorch version \| 1.9.0 "nearest" \| this PR "nearest" \| this PR "nearest-exact" ---\|---\|---\|--- Resize option: \| \| OpenCV INTER_NEAREST result mismatches \| 0 \| 0 \| 10 OpenCV INTER_NEAREST_EXACT result mismatches \| 9 \| 9 \| 9 Scikit-Image result mismatches \| 10 \| 10 \| 0 Pillow result mismatches \| 10 \| 10 \| 7 TensorFlow result mismatches \| 10 \| 10 \| 0 Rescale option: \| \| \| size mismatches (https://github.com/pytorch/pytorch/issues/62396) \| 10 \| 10 \| 10 OpenCV INTER_NEAREST result mismatches \| 3 \| 3\| 5 OpenCV INTER_NEAREST_EXACT result mismatches \| 3 \| 3\| 4 Scikit-Image result mismatches \| 4 \| 4 \| 0 Scipy result mismatches \| 4 \| 4 \| 0 TensorFlow: no such option \| - \| - Versions: ``` skimage: 0.19.0.dev0 opencv: 4.5.4-dev scipy: 1.7.2 Pillow: 8.4.0 TensorFlow: 2.7.0 ``` Implementations in other libs: - Pillow: - `ee079ae67e/src/libImaging/Geometry.c (L889-L899)` - `ee079ae67e/src/libImaging/Geometry.c (L11)` - `a[2] == 0` - Scikit-Image : - dev v0.19.0 uses scipy ndi.zoom: - `38fae50c3f/skimage/transform/_warps.py (L180-L188)` - `47bb6febaa/scipy/ndimage/src/ni_interpolation.c (L775-L779)` - `47bb6febaa/scipy/ndimage/src/ni_interpolation.c (L479)` Additionally: - Updated upsampling tests cc ezyang gchanan albanD mruberry jbschlosser walterddr fmassa heitorschueroff ppwwyyxx Pull Request resolved: https://github.com/pytorch/pytorch/pull/64501 Reviewed By: anjali411 Differential Revision: D32361901 Pulled By: jbschlosser fbshipit-source-id: df906f4d25a2b2180e1942ffbab2cc14600aeed2	2021-11-15 14:28:19 -08:00
yanbing-j	12026124cc	Avoid the view for mkldnn case in 1D convolution (#68166 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/68034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/68166 Reviewed By: mrshenli Differential Revision: D32432444 Pulled By: jbschlosser fbshipit-source-id: fc4e626d497d9e4597628a18eb89b94518bb3b33	2021-11-15 11:56:45 -08:00
eqy	a1ace029e2	Add host-side memory requirement for `test_softmax_64bit_indexing` (#67922 ) Summary: https://github.com/pytorch/pytorch/issues/67910 The original `largeTensorTest` decorator didn't account for the additional host-side memory requirements. Thanks crcrpar for raising the issue, CC ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/67922 Reviewed By: malfet Differential Revision: D32308602 Pulled By: mruberry fbshipit-source-id: 97b7d2c39fe63c1a8269402f72186026a89f6b4c	2021-11-11 09:24:15 -08:00
Dani El-Ayyass	f171c78c04	add unpack_sequence and unpad_sequence functions (#66550 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/66549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66550 Reviewed By: malfet Differential Revision: D32299193 Pulled By: jbschlosser fbshipit-source-id: 96c92d73d3d40b7424778b2365e0c8bb1ae56cfb	2021-11-10 15:15:08 -08:00
Joel Schlosser	9a2db6f091	Factor backend routing logic out of convolution forward (#67790 ) Summary: This PR introduces a new function `_select_conv_backend` that returns a `ConvBackend` enum representing the selected backend for a given set of convolution inputs and params. The function and enum are exposed to python for testing purposes through `torch/csrc/Module.cpp` (please let me know if there's a better place to do this). A new set of tests validates that the correct backend is selected for several sets of inputs + params. Some backends aren't tested yet: * nnpack (for mobile) * xnnpack (for mobile) * winograd 3x3 (for mobile) Some flowcharts for reference: ![conv_routing_graph md](https://user-images.githubusercontent.com/75754324/140828957-1135b400-38c0-4c9f-87ef-4f33ceebeeae.png) ![conv_nogroup_routing_graph md](https://user-images.githubusercontent.com/75754324/140828977-ed223a4e-aa86-49f1-9925-c0f6b9ab36af.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/67790 Reviewed By: zou3519 Differential Revision: D32280878 Pulled By: jbschlosser fbshipit-source-id: 0ce55174f470f65c9b5345b9980cf12251f3abbb	2021-11-10 07:53:55 -08:00
Xiao Wang	f6a4c80a5a	Refactor cuDNN Convolution memory format and Conv-Bias-Relu code (#65594 ) Summary: This PR makes several changes: - Changed function `bool cudnn_conv_use_channels_last(...)` to `at::MemoryFormat cudnn_conv_suggest_memory_format(...)` - Removed `resize_` in cudnn convolution code. Added a new overloading method `TensorDescriptor::set` that also passes the desired memory format of the tensor. - Disabled the usage of double + channels_last on cuDNN Conv-Relu and Conv-Bias-Relu. Call `.contiguous(memory_format)` before passing data to cuDNN functions. - Disabled the usage of cuDNN fused Conv-Bias-Relu in cuDNN < 8.0 version due to a CUDNN_STATUS_NOT_SUPPORTED error. Instead, use the native fallback path. - Let Conv-Bias-Relu code respect the global `allow_tf32` flag. From cuDNN document, double + NHWC is genenrally not supported. Close https://github.com/pytorch/pytorch/pull/66968 Fix https://github.com/pytorch/pytorch/issues/55301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/65594 Reviewed By: jbschlosser, malfet Differential Revision: D32175766 Pulled By: ngimel fbshipit-source-id: 7ba079c9f7c46fc56f8bfef05bad0854acf380d7	2021-11-05 11:50:55 -07:00
Alban Desmaison	bb8978f605	Revert D32175963: Converting hardswish to strucutred kernels with metatensor support Test Plan: revert-hammer Differential Revision: D32175963 (`57335a9ee3`) Original commit changeset: f4d749c6aeaf fbshipit-source-id: 6d68a96cf872c2d7b518c061875b9336bca0043a	2021-11-05 07:04:40 -07:00
John Clow	57335a9ee3	Converting hardswish to strucutred kernels with metatensor support (#66899 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66899 Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D32175963 Pulled By: Gamrix fbshipit-source-id: f4d749c6aeaf064084be72361607ea4f3f6bc91d	2021-11-04 19:02:00 -07:00
soulitzer	83e8612d11	Clean up test autograd (#67413 ) Summary: Partially fixes https://github.com/pytorch/pytorch/issues/66066 This PR: - cleans up op-specific testing from test_autograd. test_autograd should be reserved for testing generic autograd functionality - tests related to an operator are better colocated - see the tracker for details What to think about when moving tests to their correct test suite: - naming, make sure its not too generic - how the test is parametrized, sometimes we need to add/remove a device/dtype parameter - can this be merged with existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/67413 Reviewed By: jbschlosser, albanD Differential Revision: D32031480 Pulled By: soulitzer fbshipit-source-id: 8e13da1e58a38d5cecbfdfd4fe2b4fe6f816897f	2021-11-03 15:26:09 -07:00
Xiao Wang	31cf3d6aad	Fix adaptive_max_pool2d for channels-last on CUDA (#67697 ) Summary: Fix https://github.com/pytorch/pytorch/issues/67239 The CUDA kernels for `adaptive_max_pool2d` (forward and backward) were written for contiguous output. If outputs are non-contiguous, first create a contiguous copy and let the kernel write output to the contiguous memory space. Then copy the output from contiguous memory space to the original non-contiguous memory space. Pull Request resolved: https://github.com/pytorch/pytorch/pull/67697 Reviewed By: ejguan Differential Revision: D32112443 Pulled By: ngimel fbshipit-source-id: 0e3bf06d042200c651a79d13b75484526fde11fe	2021-11-03 09:47:29 -07:00
John Shen	234bd6dc56	[quantized] Add bilinear quantized grid_sample (#66879 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66879 This adds a quantized implementation for bilinear gridsample. Bicubic interpolation cannot be supported as easily since we rely on the linearity of quantization to operate on the raw values, i.e. f(q(a), q(b)) = q(f(a, b)) where f is the linear interpolation function. ghstack-source-id: 141321116 Test Plan: test_quantization Reviewed By: kimishpatel Differential Revision: D31656893 fbshipit-source-id: d0bc31da8ce93daf031a142decebf4a155943f0f	2021-11-01 14:44:26 -07:00
kshitij12345	885a8e53ba	replace onlyOnCPUAndCUDA with onlyNativeDeviceTypes (#65201 ) Summary: Reference https://github.com/pytorch/pytorch/issues/53849 Replace `onlyOnCPUandCUDA` with `onlyNativeDeviceTypes` which includes `cpu, cuda and meta`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65201 Reviewed By: mrshenli Differential Revision: D31299718 Pulled By: mruberry fbshipit-source-id: 2d8356450c035d6a314209ab51b2c237583920fd	2021-11-01 09:22:34 -07:00
Joel Schlosser	16d937b0df	Fix strided _conv_double_backward() with 3D input / weight (#67283 ) Summary: Removes the 3D special case logic in `_convolution_double_backward()` that never worked. The logic was never called previously since `convolution()` expands input / weight from 3D -> 4D before passing them to backends; backend-specific backward calls thus save the 4D version to pass to `_convolution_double_backward()`. The new general `convolution_backward()` saves the original 3D input / weight, uncovering the bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/67283 Reviewed By: anjali411 Differential Revision: D32021100 Pulled By: jbschlosser fbshipit-source-id: 0916bcaa77ef49545848b344d6385b33bacf473d	2021-10-29 09:48:53 -07:00

... 2 3 4 5 6 ...

1436 Commits