Commit Graph

1598 Commits

Author SHA1 Message Date
Eddie Yan
2bc82163eb Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (re-open of #85037) (#85373)
CC @ngimel @xwang233 @ptrblck

Adds fix for `get_tolerances`, tested locally on a dgx Volta.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85373
Approved by: https://github.com/ngimel
2022-09-22 07:34:47 +00:00
Mikayla Gawarecki
77f1f98479 Re-introduce torch.Tensor.to_padded_tensor (#85293)
Differential Revision: [D39629004](https://our.internmc.facebook.com/intern/diff/D39629004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85293
Approved by: https://github.com/cpuhrsch
2022-09-21 18:45:56 +00:00
PyTorch MergeBot
53fdd60635 Revert "Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (#85037)"
This reverts commit 66a9cba221.

Reverted https://github.com/pytorch/pytorch/pull/85037 on behalf of https://github.com/clee2000 due to broke test_warp_softmax_64bit_indexing_cuda_float32 and test_warp_softmax_64bit_indexing_cuda_float16 on rocm https://github.com/pytorch/pytorch/actions/runs/3085764744/jobs/4989643817
2022-09-20 00:13:41 +00:00
eqy
66a9cba221 Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (#85037)
For reference: #84944

CC @xwang233 @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85037
Approved by: https://github.com/ngimel, https://github.com/pmeier
2022-09-19 21:31:08 +00:00
Elias Ellison
f37069aac7 Re-enable fixed dynamo tests (#84969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84969
Approved by: https://github.com/bdhirsh, https://github.com/ezyang
2022-09-16 15:36:52 +00:00
Michael Melesse
b6d6a78c12 [ROCM] test_batchnorm_cudnn_nhwc (#84603)
This pr enables test_batchnorm_cudnn_nhwc.  This is a follow up to https://github.com/pytorch/pytorch/pull/82512
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84603
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2022-09-14 15:50:14 +00:00
Mikayla Gawarecki
e217b30b0f Add torch.nested namespace (#84102)
First step towards #83775
- only `to_padded_tensor` is moved to the nested namespace for now
- following the schema used for `special`, `fft`, `linalg` and other namespaces, nested functions are registered in native_functions.yaml as `nested_{function_name}` and are bound to the desired Python name in
`torch/nested/__init__.py`, and the desired C++ name in `torch/csrc/api/include/torch/nested.h`.

~~**Question**: should we keep the documentation for `Tensor.to_padded_tensor` or can this deleted since it is shared by `torch.nested.to_padded_tensor`?~~

[generated nested docs](https://docs-preview.pytorch.org/84102/nested.html?highlight=nested#module-torch.nested)

Differential Revision: [D39361148](https://our.internmc.facebook.com/intern/diff/D39361148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84102
Approved by: https://github.com/drisspg
2022-09-12 16:31:05 +00:00
Kshiteej K
6d6e04d6cc [test_nn] move dropout tests to test/nn/test_dropout.py (#84165)
Ref https://github.com/pytorch/pytorch/issues/63085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84165
Approved by: https://github.com/albanD
2022-09-03 07:21:48 +00:00
Elias Ellison
f701cb04fb Test Dynamo CI w Fake Tensors (#84282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84282
Approved by: https://github.com/anijain2305
2022-09-01 00:15:05 +00:00
lezcano
b106a04d76 Fix the edge case when y = 0 in kl_div (#82714)
Brought up in https://github.com/pytorch/pytorch/pull/80334#issuecomment-1193600883

We also prepare its opinfo to fix https://github.com/pytorch/pytorch/issues/80488
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82714
Approved by: https://github.com/albanD
2022-08-30 18:18:25 +00:00
Edward Z. Yang
ad44670fa1 Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628)" (#84173)
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"

Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9

Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159

Also update Metal registrations for C++ registration changes.

Also update NNPI registration to account for tightened schema checking

Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
2022-08-29 18:01:07 +00:00
soulitzer
7088a98fba conv2d: require bias to have the same dtype as input and weight on cpu (#83686)
Fixes https://github.com/pytorch/pytorch/issues/83505

BC-breaking message:
- Previously we only required input and weight to have the same dtype on cpu (when input is non-complex). After this change, the dtype of bias is now also expected to have the same dtype. This change was necessary to improve the error message for certain combinations of inputs. This behavior now also matches that of convolution on cuda.

<details>
<summary>
Old plan
</summary>
Previously convolution (at least for slow_conv2d) did not perform type promotion, i.e. the output of `conv(int, int, float)` is an int, and that leads to the autograd assert.

This PR adds type promotion handling at the `at::native::conv2d` (this is a composite) level. We also need to correct or remove many tests that assume that conv errors when input types are mixed

Pros:
- Doing type promotion at this level avoids the complex path from having any special handling for mixed dtypes, and can potentially speed up mixed dtype inputs to now dispatch to faster kernels which are only capable of handling floats.

Cons:
- Doing type promotion at this level has the risk of introducing extra overhead when we would've dispatched to a kernel capable of handle mixed type anyway. I don't know if any of these exist at all though - it is possible that inputs with any non-float arguments are dispatched to the slow path.

If this approach is OK, we can proceed with the other convolutions as well:
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83686
Approved by: https://github.com/ngimel
2022-08-29 16:41:17 +00:00
Natalia Gimelshein
0ac2986d33 Fixes softmax indexing for large tensors (#84182)
Fixes #84144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84182
Approved by: https://github.com/janeyx99
2022-08-29 04:29:09 +00:00
PyTorch MergeBot
c7edcd6968 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 9790d90e4b.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487
2022-08-27 01:23:17 +00:00
Animesh Jain
6a58603956 Update Dynamo pin (#83829)
As title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829
Approved by: https://github.com/ezyang
2022-08-26 20:49:43 +00:00
Edward Z. Yang
9790d90e4b Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-26 01:35:40 +00:00
zaf
2f04ba2c7c [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:38 +00:00
XiaobingSuper
a013597b32 fix oneDNN channels_last path issue (#83653)
Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653
Approved by: https://github.com/mingfeima, https://github.com/malfet
2022-08-25 03:58:11 +00:00
PyTorch MergeBot
a7edf71360 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 8fae7027b3.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222
2022-08-25 00:49:40 +00:00
kshitij12345
7a8152530d move pooling test from test_nn to test/nn/test_pooling (#83915)
Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
Approved by: https://github.com/albanD
2022-08-24 16:17:50 +00:00
Ishan-Rajgarhia
7fdc2f70c6 Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)
See https://github.com/pytorch/pytorch/issues/38095
Replaced assertEqualIgnoreType with assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
Approved by: https://github.com/kit1980
2022-08-24 02:45:52 +00:00
Edward Z. Yang
8fae7027b3 Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-23 22:04:07 +00:00
Khushi Agrawal
9095030239 [fix] edge case in MaxPool1d and add ErrorInputs (#83553)
Fixes #83224

cc @kshitij12345 @albanD!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
Approved by: https://github.com/albanD
2022-08-23 19:23:39 +00:00
Kshiteej K
dd67d52b57 [nn] split rnn_utils test from test_nn.py (#83675)
Ref: https://github.com/pytorch/pytorch/issues/63085
Proposed folder structure
```
-> test
  -> nn
    -> test_conv.py
    -> test_pooling.py
    -> .....
```

This PR: Moves test related RNN utilities to a different file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
Approved by: https://github.com/albanD
2022-08-23 08:34:39 +00:00
XiaobingSuper
658f958bc4 fix upsample bf16 issue for channels last path by using high pricsion to compute index (#83847)
Given the following case:
```
import torch
a = torch.ones(1, 3, 320, 480).bfloat16().to(memory_format=torch.channels_last)
out_bf16 = torch.nn.functional.interpolate(a, size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False)
out_fp32= torch.nn.functional.interpolate(a.float(), size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False)
print(out_bf16[0, 2, :, :])
print(out_fp32[0, 2, :, :])
```
the boundary of bfloat16 output gets a wrong value:
```
tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        ...,
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 1.0000e+00, 1.0000e+00,
         1.0000e+00],
        [0.0000e+00, 0.0000e+00, 1.8367e-40,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00]], dtype=torch.bfloat16)
tensor([[1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        ...,
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.],
        [1., 1., 1.,  ..., 1., 1., 1.]])

```

the expected behavior is that the bfloat16 output value should also be one. The main reason is that we use low precision to compute the index,  see
fcb124406b/aten/src/ATen/native/UpSample.h (L448), we should use a high precison to do the computation as GPU path:
fcb124406b/aten/src/ATen/native/cuda/UpSample.cuh (L123)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83847
Approved by: https://github.com/frank-wei
2022-08-23 00:53:37 +00:00
PyTorch MergeBot
4cbb1986fe Revert "[quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)"
This reverts commit 7cd2fa1d38.

Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted
2022-08-22 07:23:24 +00:00
zaf
7cd2fa1d38 [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-22 05:33:23 +00:00
Rui Zhu
e0f2eba93d Move odd num_head in TransformerEncoder to slow_path (#83483)
Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path

Test Plan: CI

Differential Revision: D38720086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483
Approved by: https://github.com/erichan1
2022-08-20 10:02:08 +00:00
Jeff Daily
d52d2bd5a9 [ROCm] MIOpen fused convolution relu (#82002)
Adds MIOpen fused convolution relu for fp32 and contiguous memory format.  Adds fallbacks for conv + z + bias + relu, fp16, and channels last until MIOpen adds these features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82002
Approved by: https://github.com/ngimel, https://github.com/malfet
2022-08-16 20:49:33 +00:00
Nicolas Macchioni
b236352036 Add mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947

Transformer fastpath multiplexes two arguments, src_mask [seq_len x seq_len] and src_key_padding_mask [batch_size x seq_len], and later deduces the type based on mask shape.

In the event that batch_size == seq_len, any src_mask is wrongly interpreted as a src_key padding_mask. This is fixed by requiring a mask_type identifier be supplied whenever batch_size == seq_len.

Additionally, added support for src_mask in masked_softmax CPU path.

Test Plan: existing unit tests + new unit tests (batch_size == seq_len)

Differential Revision: D37932240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947
Approved by: https://github.com/zrphercule
2022-08-09 23:42:16 +00:00
Sergii Dymchenko
7390ae837c Resolve TODO for GroupNorm numerical issues (#82423)
Looks like the numerical issues are resolved now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82423
Approved by: https://github.com/ngimel
2022-08-03 19:42:26 +00:00
Jiayi Sun
15a284b09e optimize softmax backward and logsoftmax backward (#80114)
Currently, if we run softmax_backward/logsoftmax_backward which are not along the last dim, the calculation will fall to a [scalar version](32593ef2dd/aten/src/ATen/native/SoftMax.cpp (L220-L287)). And we find actually we have the chance to vectorize the calculation along the inner_size dim.

Changes we made:

Use vectorized softmax_backward_kernel/log_softmax_backward_kernel instead of host_softmax_backward when not along the last dim.

We collected the benchmark data of softmax_backward and logsoftmax_backward for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz.
Number of cores: 24 cores(1 socket)
[softmax_benchmark_32593ef.log](https://github.com/pytorch/pytorch/files/8962956/softmax_benchmark_32593ef.log)
[softmax_benchmark_the_pr.log](https://github.com/pytorch/pytorch/files/8962958/softmax_benchmark_the_pr.log)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80114
Approved by: https://github.com/frank-wei
2022-08-03 00:36:28 +00:00
mingfeima
b019a41674 fix bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392)
To fix https://github.com/pytorch/pytorch/issues/82060

When `input` is not explicitly converted to channels last while `conv` has, the output should also be in channels last. The root cause is that when input has IC of 1, `compute_columns2d` from `\aten\src\ATen\native\ConvolutionMM2d.cpp` would consider it as channels first:

We do have logic to make sure both input and weight have the same memory format even if they are given differently, like:
```
auto input = self.contiguous(memory_format);
auto weight = weight_.contiguous(memory_format);
```

But  for a N1HW input, `.contiguous(MemoryFormat::ChannelsLast)` would not change its stride , and its `suggest_memory_format()` still returns `MemoryFormat::Contiguous`. That's how it went wrong.

Also updated the corresponding test cases, without this patch, the new test case would fail on forward path and runtime error on backward path.

attach old fail log on forward path:
```
FAIL: test_conv_thnn_nhwc_cpu_float32 (__main__.TestNNDeviceTypeCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
    result = test(self, **param_kwargs)
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 974, in only_fn
    return fn(slf, *args, **kwargs)
  File "test/test_nn.py", line 19487, in test_conv_thnn_nhwc
    input_format=torch.contiguous_format, weight_format=torch.channels_last)
  File "test/test_nn.py", line 19469, in helper
    self.assertEqual(out, ref_out, exact_dtype=False)
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2376, in assertEqual
    msg=(lambda generated_msg: f"{generated_msg} : {msg}") if isinstance(msg, str) and self.longMessage else msg,
  File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 988 / 1024 (96.5%)
Greatest absolute difference: 42.0 at index (1, 2, 6, 6) (up to 1e-05 allowed)
Greatest relative difference: inf at index (0, 0, 2, 1) (up to 1.3e-06 allowed)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82392
Approved by: https://github.com/jbschlosser
2022-07-28 14:20:52 +00:00
Khushi Agrawal
050aec1805 [nn] add pop to sequential and ModuleList (#81601)
Follows #71329

cc @kshitij12345!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81601
Approved by: https://github.com/albanD
2022-07-25 19:32:32 +00:00
Ansh Radhakrishnan
110cd724fc [nn] Add support for +=, * and *= operations for nn.Sequential objects (#81279)
Fixes 71329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81279
Approved by: https://github.com/albanD
2022-07-25 15:48:47 +00:00
soulitzer
f595467e5c Reenable slow gradcheck and make it pass (#80514)
Context: For a while slow gradcheck CI was skipping nearly all tests and this hid the fact that it should've been failing and timing out (10+h runtime for TestGradients). The CI configuration has since been fixed to correct this, revealing the test failures. This PR reenables slow gradcheck CI and makes it pass again.

This PR:
- makes slow and failing tests run in fast gradcheck mode only
- reduce the input size for slow gradcheck only for unary/binary ufuncs (alternatively, skip the test entirely)
- skip entire test files on slow gradcheck runner if they don't use gradcheck (test_ops, test_meta, test_decomp, test_ops_jit)
- reduces the input size for some ops

Follow ups:
1. Investigate slow mode failures https://github.com/pytorch/pytorch/issues/80411
2. See if we can re-enable slow gradcheck tests for some of the slow tests by reducing the sizes of their inputs

The following are failing in slow mode, they are now running in fast mode only.
```
test_fn_fwgrad_bwgrad___rmod___cuda_float64
test_fn_fwgrad_bwgrad_linalg_householder_product_cuda_complex128
test_fn_fwgrad_bwgrad__masked_prod_cuda_complex128
test_fn_fwgrad_bwgrad__masked_prod_cuda_float64
test_fn_fwgrad_bwgrad_linalg_matrix_power_cuda_complex128
test_fn_fwgrad_bwgrad_cat_cuda_complex128
test_fn_fwgrad_bwgrad_linalg_lu_factor_ex_cuda_float64
test_fn_fwgrad_bwgrad_copysign_cuda_float64
test_fn_fwgrad_bwgrad_cholesky_inverse_cuda_complex128
test_fn_fwgrad_bwgrad_float_power_cuda_complex128
test_fn_fwgrad_bwgrad_fmod_cuda_float64
test_fn_fwgrad_bwgrad_float_power_cuda_float64
test_fn_fwgrad_bwgrad_linalg_lu_cuda_float64
test_fn_fwgrad_bwgrad_remainder_cuda_float64
test_fn_fwgrad_bwgrad_repeat_cuda_complex128
test_fn_fwgrad_bwgrad_prod_cuda_complex128
test_fn_fwgrad_bwgrad_slice_scatter_cuda_float64
test_fn_fwgrad_bwgrad_tile_cuda_complex128
test_fn_fwgrad_bwgrad_pow_cuda_float64
test_fn_fwgrad_bwgrad_pow_cuda_complex128
test_fn_fwgrad_bwgrad_fft_*
test_fn_fwgrad_bwgrad_zero__cuda_complex128
test_fn_gradgrad_linalg_lu_factor_cuda_float64
test_fn_grad_div_trunc_rounding_cuda_float64
test_fn_grad_div_floor_rounding_cuda_float64
```

Marks the OpInfos for the following ops that run slowly in slow gradcheck as `fast_gradcheck` only (the left column represents runtime in seconds):
```
0  918.722  test_fn_fwgrad_bwgrad_nn_functional_conv_transpose3d_cuda_float64
1  795.042  test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_complex128
2  583.63  test_fn_fwgrad_bwgrad_nn_functional_max_pool3d_cuda_float64
3  516.946  test_fn_fwgrad_bwgrad_svd_cuda_complex128
4  503.179  test_fn_fwgrad_bwgrad_linalg_svd_cuda_complex128
5  460.985  test_fn_fwgrad_bwgrad_linalg_lu_cuda_complex128
6  401.04  test_fn_fwgrad_bwgrad_linalg_lstsq_grad_oriented_cuda_complex128
7  353.671  test_fn_fwgrad_bwgrad_nn_functional_max_pool2d_cuda_float64
8  321.903  test_fn_fwgrad_bwgrad_nn_functional_gaussian_nll_loss_cuda_float64
9  307.951  test_fn_fwgrad_bwgrad_stft_cuda_complex128
10  266.104  test_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64
11  221.032  test_fn_fwgrad_bwgrad_istft_cuda_complex128
12  183.741  test_fn_fwgrad_bwgrad_lu_unpack_cuda_complex128
13  132.019  test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_float64
14  125.343  test_fn_fwgrad_bwgrad_nn_functional_pad_constant_cuda_complex128
15  124.2  test_fn_fwgrad_bwgrad_kron_cuda_complex128
16  123.721  test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64
17  121.074  test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64
18  119.387  test_fn_fwgrad_bwgrad_rot90_cuda_complex128
19  112.889  test_fn_fwgrad_bwgrad__masked_normalize_cuda_complex128
20  107.541  test_fn_fwgrad_bwgrad_dist_cuda_complex128
21  106.727  test_fn_fwgrad_bwgrad_diff_cuda_complex128
22  104.588  test_fn_fwgrad_bwgrad__masked_cumprod_cuda_complex128
23  100.135  test_fn_fwgrad_bwgrad_nn_functional_feature_alpha_dropout_with_train_cuda_float64
24  88.359  test_fn_fwgrad_bwgrad_mH_cuda_complex128
25  86.214  test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64
26  83.037  test_fn_fwgrad_bwgrad_nn_functional_bilinear_cuda_float64
27  79.987  test_fn_fwgrad_bwgrad__masked_cumsum_cuda_complex128
28  77.822  test_fn_fwgrad_bwgrad_diag_embed_cuda_complex128
29  76.256  test_fn_fwgrad_bwgrad_mT_cuda_complex128
30  74.039  test_fn_fwgrad_bwgrad_linalg_lu_solve_cuda_complex128
```
```
0  334.142  test_fn_fwgrad_bwgrad_unfold_cuda_complex128
1  312.791  test_fn_fwgrad_bwgrad_linalg_lu_factor_cuda_complex128
2  121.963  test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64
3  108.085  test_fn_fwgrad_bwgrad_diff_cuda_complex128
4  89.418  test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64
5  72.231  test_fn_fwgrad_bwgrad___rdiv___cuda_complex128
6  69.433  test_fn_fwgrad_bwgrad___getitem___cuda_complex128
7  68.582  test_fn_fwgrad_bwgrad_ldexp_cuda_complex128
8  68.572  test_fn_fwgrad_bwgrad_linalg_pinv_cuda_complex128
9  67.585  test_fn_fwgrad_bwgrad_nn_functional_glu_cuda_float64
10  66.567  test_fn_fwgrad_bwgrad_lu_cuda_float64
```
```
0  630.13  test_fn_gradgrad_nn_functional_conv2d_cuda_complex128
1  81.086  test_fn_gradgrad_linalg_solve_triangular_cuda_complex128
2  71.332  test_fn_gradgrad_norm_cuda_complex128
3  64.308  test_fn_gradgrad__masked_std_cuda_complex128
4  59.519  test_fn_gradgrad_div_no_rounding_mode_cuda_complex128
5  58.836  test_fn_gradgrad_nn_functional_adaptive_avg_pool3
```

Reduces the sizes of the inputs for:
- diff
- diag_embed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80514
Approved by: https://github.com/albanD
2022-07-22 02:05:37 +00:00
Saketh Are
445ee5620e Simplify torch.nn.grad by calling into aten::convolution_backward (#81839)
`torch.nn.grad` has its own implementations of gradients for conv1d, conv2d, and conv3d. This PR simplifies them by calling into the unified `aten::convolution_backward` backend instead.

The existing implementation of conv2d_weight is incorrect for some inputs (see issue #51430). This PR fixes the issue.

This PR expands coverage in test_nn to include conv1d_weight, conv2d_weight, and conv3d_weight, which were previously untested. It also expands the cases for conv2d to cover issue #51430.

Fixes #51430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81839
Approved by: https://github.com/albanD
2022-07-21 19:34:27 +00:00
Khushi Agrawal
dced803339 [nn] add insert method to sequential class (#81402)
Follows #71329

cc @kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81402
Approved by: https://github.com/albanD
2022-07-20 14:45:52 +00:00
Khushi Agrawal
2c0b11b43b [nn] implement extend method to sequential class (#81179)
Follows #71329

cc @kshitij12345 :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81179
Approved by: https://github.com/albanD
2022-07-20 05:33:41 +00:00
PyTorch MergeBot
f82b19f15b Revert "Disable use_mkldnn when input is not contiguous for oneDNN (#80864)"
This reverts commit 4655c3bace.

Reverted https://github.com/pytorch/pytorch/pull/80864 on behalf of https://github.com/janeyx99 due to Reverting due for a perf regression https://github.com/pytorch/benchmark/issues/1040
2022-07-19 18:58:52 +00:00
yanbing-j
4655c3bace Disable use_mkldnn when input is not contiguous for oneDNN (#80864)
Fixes [#80837](https://github.com/pytorch/pytorch/issues/80837).
This PR is to disable use_mkldnn when input is not contiguous for oneDNN requirement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80864
Approved by: https://github.com/malfet
2022-07-17 14:58:26 +00:00
Rui Zhu
b22166fd62 Add a small fastpath test for native mha (#81432)
Summary: We dont have a small fast path passing test for mha before, this diff added one for better testing

Test Plan: buck build mode/dev-nosan -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/dev/gen/caffe2/test/nn\#binary.par -r test_multihead_attn_fast_path_small_test

Differential Revision: D37834319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81432
Approved by: https://github.com/erichan1
2022-07-15 23:54:40 +00:00
Eric Han
23088fcfdf disable src mask for transformer and multiheadattention fastpath (#81277)
Disable fastpath if src_mask passed to TransformerEncoderLayer and MultiheadAttention.
- Refactored test_transformerencoder from test_nn.py to test_transformers.py. Added a src_mask test there.
- Added a specific src_mask test in test_transformers.py

Fixes https://github.com/pytorch/pytorch/issues/81129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81277
Approved by: https://github.com/zrphercule
2022-07-15 20:55:17 +00:00
n.zhuravlev
7af0200a46 Add deepcopy functionality to parametrized modules (#80811)
Fixes #69413

After applying parametrization to any `nn.Module` we lose the ability to create a deepcopy of it e.g. it makes it impossible to wrap a module by an `AveragedModel`.
Specifically, the problem is that the `deepcopy` tries to invoke `__getstate__` if object hasn't implemented its own `__deepcopy__` magic method. But we don't allow serialization of the parametrized modules: `__getstate__` raises an error.
My solution is just to create a default `__deepcopy__` method when it doesn't exist yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80811
Approved by: https://github.com/pearu, https://github.com/albanD
2022-07-15 09:06:45 +00:00
Khushi Agrawal
3da8c909da [nn] add + operator for torch.nn.Sequential to concatenate (#81170)
Fixes #78512

#### TODO
- [x] add tests

cc @kshitij12345!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81170
Approved by: https://github.com/albanD
2022-07-11 17:49:58 +00:00
eqy
3b78c5682b Don't implicitly convert to channels-first in MaxPool3D on CUDA (#80748)
MaxPool3D currently converts inputs implicitly to channels-first (via `.contiguous()`) which may yield unexpected regressions in workloads that expect a full channels-last path. This PR preserves the channels-last format in MaxPool3D while attempting to avoid seriously regressing performance.

Currently, typical case (kernel size == 2 == stride) looks good, but larger kernel sizes (>4) or the unusual case of stride 1 can sometimes be slower than converting to channels-first before doing MaxPool3D.

Additionally, this PR adds a test for 64bit-indexing backwards as testing of these changes uncovered an IMA for large tensors when doing the backwards pass with MaxPool3D.

Performance comparison on A6000:

```
[------------------------------------- max_pool3d ---------------------------------------------------------]
                                          |  channels_last=False  |   curr ch_last=True |  new ch_last=True
1 threads: ---------------------------------------------------------------------------- ---------------------
      [64, 256, 32, 32, 32] 4x4 stride 4  |        20093.5        |       34823.4       |       20640.0
      [64, 256, 32, 32, 32] 4x4 stride 2  |        28623.7        |       42625.6       |       27935.5
      [64, 256, 32, 32, 32] 4x4 stride 1  |        68177.5        |       79147.2       |       85604.8
      [64, 256, 32, 32, 32] 2x2 stride 4  |        17237.7        |       32071.3       |       16641.6
      [64, 256, 32, 32, 32] 2x2 stride 2  |        25252.5        |       39993.2       |       25054.8
      [64, 256, 32, 32, 32] 2x2 stride 1  |        43185.2        |       58164.6       |       48416.9
      [64, 256, 16, 16, 16] 4x4 stride 4  |         3017.7        |        3952.4       |        2593.8
      [64, 256, 16, 16, 16] 4x4 stride 2  |         4581.5        |        5384.3       |        3294.3
      [64, 256, 16, 16, 16] 4x4 stride 1  |        11334.1        |       11534.7       |        8651.1
      [64, 256, 16, 16, 16] 2x2 stride 4  |         2346.9        |        3304.6       |        2098.8
      [64, 256, 16, 16, 16] 2x2 stride 2  |         3550.8        |        4526.5       |        3143.6
      [64, 256, 16, 16, 16] 2x2 stride 1  |         6898.1        |        7816.0       |        5820.8
      [64, 256, 4, 4, 4] 4x4 stride 4     |          191.5        |         176.3       |          77.5
      [64, 256, 4, 4, 4] 4x4 stride 2     |          191.8        |         176.8       |          94.1
      [64, 256, 4, 4, 4] 4x4 stride 1     |          191.3        |         176.4       |          97.3
      [64, 256, 4, 4, 4] 2x2 stride 4     |           96.4        |         114.4       |          93.6
      [64, 256, 4, 4, 4] 2x2 stride 2     |          172.1        |         178.6       |          93.7
      [64, 256, 4, 4, 4] 2x2 stride 1     |          263.0        |         279.4       |          92.4
      [64, 64, 32, 32, 32] 4x4 stride 4   |         5033.2        |        7208.3       |        5167.5
      [64, 64, 32, 32, 32] 4x4 stride 2   |         7216.1        |        9218.7       |        6637.1
      [64, 64, 32, 32, 32] 4x4 stride 1   |        17192.1        |       18392.9       |       20489.0
      [64, 64, 32, 32, 32] 2x2 stride 4   |         4318.0        |        6511.2       |        4193.1
      [64, 64, 32, 32, 32] 2x2 stride 2   |         6324.4        |        8657.7       |        6263.6
      [64, 64, 32, 32, 32] 2x2 stride 1   |        10855.0        |       13040.2       |       12055.9
      [64, 64, 16, 16, 16] 4x4 stride 4   |          764.1        |         975.6       |         671.3
      [64, 64, 16, 16, 16] 4x4 stride 2   |         1163.1        |        1333.4       |         833.6
      [64, 64, 16, 16, 16] 4x4 stride 1   |         2890.0        |        2898.5       |        2209.8
      [64, 64, 16, 16, 16] 2x2 stride 4   |          593.5        |         811.2       |         536.3
      [64, 64, 16, 16, 16] 2x2 stride 2   |          895.9        |        1112.3       |         794.5
      [64, 64, 16, 16, 16] 2x2 stride 1   |         1742.5        |        1968.0       |        1475.2
      [64, 64, 4, 4, 4] 4x4 stride 4      |          101.1        |         112.2       |          93.4
      [64, 64, 4, 4, 4] 4x4 stride 2      |           96.7        |         114.6       |          92.5
      [64, 64, 4, 4, 4] 4x4 stride 1      |           98.9        |         111.9       |          96.5
      [64, 64, 4, 4, 4] 2x2 stride 4      |          100.1        |         107.1       |          94.2
      [64, 64, 4, 4, 4] 2x2 stride 2      |           96.6        |         108.0       |          94.5
      [64, 64, 4, 4, 4] 2x2 stride 1      |           96.7        |         107.9       |          95.2
      [64, 3, 32, 32, 32] 4x4 stride 4    |          250.1        |         326.6       |         278.0
      [64, 3, 32, 32, 32] 4x4 stride 2    |          350.4        |         414.0       |         323.2
      [64, 3, 32, 32, 32] 4x4 stride 1    |          825.6        |         846.9       |         982.5
      [64, 3, 32, 32, 32] 2x2 stride 4    |          213.3        |         289.8       |         219.9
      [64, 3, 32, 32, 32] 2x2 stride 2    |          308.2        |         384.9       |         305.9
      [64, 3, 32, 32, 32] 2x2 stride 1    |          523.5        |         594.7       |         589.9
      [64, 3, 16, 16, 16] 4x4 stride 4    |          103.8        |         116.7       |          93.0
      [64, 3, 16, 16, 16] 4x4 stride 2    |          100.9        |         108.3       |          93.3
      [64, 3, 16, 16, 16] 4x4 stride 1    |          139.4        |         140.7       |         104.8
      [64, 3, 16, 16, 16] 2x2 stride 4    |           97.5        |         114.7       |          92.7
      [64, 3, 16, 16, 16] 2x2 stride 2    |           97.4        |         108.8       |          91.7
      [64, 3, 16, 16, 16] 2x2 stride 1    |           99.9        |         108.0       |          94.1
      [64, 3, 4, 4, 4] 4x4 stride 4       |           97.2        |         110.2       |          94.7
      [64, 3, 4, 4, 4] 4x4 stride 2       |          105.7        |         107.4       |          92.8
      [64, 3, 4, 4, 4] 4x4 stride 1       |           98.0        |         110.0       |          93.7
      [64, 3, 4, 4, 4] 2x2 stride 4       |           98.3        |         116.7       |          93.0
      [64, 3, 4, 4, 4] 2x2 stride 2       |           98.6        |         107.5       |          92.8
      [64, 3, 4, 4, 4] 2x2 stride 1       |          100.6        |         110.3       |          94.0
      [16, 256, 32, 32, 32] 4x4 stride 4  |         5034.2        |        8838.0       |        5165.9
      [16, 256, 32, 32, 32] 4x4 stride 2  |         7236.3        |       10869.9       |        7038.2
      [16, 256, 32, 32, 32] 4x4 stride 1  |        17385.4        |       21401.6       |       21900.7
      [16, 256, 32, 32, 32] 2x2 stride 4  |         4318.7        |        8101.2       |        4172.9
      [16, 256, 32, 32, 32] 2x2 stride 2  |         6324.0        |       10147.5       |        6279.7
      [16, 256, 32, 32, 32] 2x2 stride 1  |        10899.7        |       14826.0       |       12256.3
      [16, 256, 16, 16, 16] 4x4 stride 4  |          765.4        |        1012.7       |         675.6
      [16, 256, 16, 16, 16] 4x4 stride 2  |         1162.8        |        1376.9       |         843.4
      [16, 256, 16, 16, 16] 4x4 stride 1  |         2928.9        |        2969.8       |        2222.5
      [16, 256, 16, 16, 16] 2x2 stride 4  |          593.5        |         845.8       |         534.2
      [16, 256, 16, 16, 16] 2x2 stride 2  |          896.9        |        1152.2       |         796.9
      [16, 256, 16, 16, 16] 2x2 stride 1  |         1750.2        |        2009.4       |        1481.8
      [16, 256, 4, 4, 4] 4x4 stride 4     |           96.6        |         107.1       |          92.7
      [16, 256, 4, 4, 4] 4x4 stride 2     |           97.9        |         114.9       |          93.8
      [16, 256, 4, 4, 4] 4x4 stride 1     |           98.2        |         115.6       |          94.0
      [16, 256, 4, 4, 4] 2x2 stride 4     |           97.0        |         106.7       |          93.8
      [16, 256, 4, 4, 4] 2x2 stride 2     |           96.8        |         108.1       |          93.3
      [16, 256, 4, 4, 4] 2x2 stride 1     |           95.8        |         120.9       |          95.7
      [16, 64, 32, 32, 32] 4x4 stride 4   |         1266.4        |        1815.4       |        1312.3
      [16, 64, 32, 32, 32] 4x4 stride 2   |         1818.5        |        2328.0       |        1678.9
      [16, 64, 32, 32, 32] 4x4 stride 1   |         4352.9        |        4649.3       |        5204.6
      [16, 64, 32, 32, 32] 2x2 stride 4   |         1090.0        |        1631.2       |        1060.8
      [16, 64, 32, 32, 32] 2x2 stride 2   |         1589.4        |        2141.1       |        1576.4
      [16, 64, 32, 32, 32] 2x2 stride 1   |         2733.5        |        3286.0       |        3041.6
      [16, 64, 16, 16, 16] 4x4 stride 4   |          201.7        |         259.6       |         175.0
      [16, 64, 16, 16, 16] 4x4 stride 2   |          301.0        |         350.1       |         226.3
      [16, 64, 16, 16, 16] 4x4 stride 1   |          740.1        |         748.7       |         570.6
      [16, 64, 16, 16, 16] 2x2 stride 4   |          156.0        |         214.8       |         140.8
      [16, 64, 16, 16, 16] 2x2 stride 2   |          232.3        |         292.3       |         208.7
      [16, 64, 16, 16, 16] 2x2 stride 1   |          449.1        |         504.0       |         382.1
      [16, 64, 4, 4, 4] 4x4 stride 4      |           97.5        |         111.4       |          94.5
      [16, 64, 4, 4, 4] 4x4 stride 2      |           98.8        |         111.9       |          94.4
      [16, 64, 4, 4, 4] 4x4 stride 1      |           98.2        |         112.0       |          95.2
      [16, 64, 4, 4, 4] 2x2 stride 4      |           99.7        |         111.0       |          94.0
      [16, 64, 4, 4, 4] 2x2 stride 2      |          100.3        |         110.0       |          93.2
      [16, 64, 4, 4, 4] 2x2 stride 1      |           97.5        |         107.6       |          93.5
      [16, 3, 32, 32, 32] 4x4 stride 4    |          100.5        |         117.1       |          95.7
      [16, 3, 32, 32, 32] 4x4 stride 2    |           97.5        |         121.3       |          92.5
      [16, 3, 32, 32, 32] 4x4 stride 1    |          216.0        |         227.4       |         258.4
      [16, 3, 32, 32, 32] 2x2 stride 4    |           97.1        |         109.0       |          91.9
      [16, 3, 32, 32, 32] 2x2 stride 2    |           95.8        |         108.5       |          92.9
      [16, 3, 32, 32, 32] 2x2 stride 1    |          139.4        |         161.2       |         157.8
      [16, 3, 16, 16, 16] 4x4 stride 4    |           96.4        |         113.6       |          91.9
      [16, 3, 16, 16, 16] 4x4 stride 2    |           97.4        |         108.1       |          93.5
      [16, 3, 16, 16, 16] 4x4 stride 1    |           99.0        |         107.5       |          92.1
      [16, 3, 16, 16, 16] 2x2 stride 4    |           96.9        |         118.1       |          93.4
      [16, 3, 16, 16, 16] 2x2 stride 2    |           97.3        |         106.7       |          95.8
      [16, 3, 16, 16, 16] 2x2 stride 1    |           98.8        |         109.2       |          93.8
      [16, 3, 4, 4, 4] 4x4 stride 4       |           97.8        |         108.0       |          94.2
      [16, 3, 4, 4, 4] 4x4 stride 2       |           92.7        |         108.0       |          93.9
      [16, 3, 4, 4, 4] 4x4 stride 1       |           97.8        |         107.6       |          93.5
      [16, 3, 4, 4, 4] 2x2 stride 4       |          100.3        |         107.7       |          94.3
      [16, 3, 4, 4, 4] 2x2 stride 2       |           97.2        |         107.5       |          96.1
      [16, 3, 4, 4, 4] 2x2 stride 1       |           98.1        |         111.1       |          93.8

Times are in microseconds (us).
```

Performance comparison on V100:
(these times have been updated after working around some noisy measurements in my setup)
```
[------------------------------------- max_pool3d ---------------------------------------------------------]
                                          |  channels_last=False  |  curr ch_last=True |  new ch_last=True
1 threads: -------------------------------------------------------------------------------------------------
      [64, 256, 32, 32, 32] 4x4 stride 4  |        15810.7        |       33807.7      |        16452.9
      [64, 256, 32, 32, 32] 4x4 stride 2  |        24422.7        |       42515.3      |        27700.3
      [64, 256, 32, 32, 32] 4x4 stride 1  |        71756.0        |       89916.5      |       106464.0
      [64, 256, 32, 32, 32] 2x2 stride 4  |        12102.9        |       30210.4      |        11319.8
      [64, 256, 32, 32, 32] 2x2 stride 2  |        19101.7        |       37210.8      |        20373.3
      [64, 256, 32, 32, 32] 2x2 stride 1  |        41418.0        |       59650.5      |        53009.2
      [64, 256, 16, 16, 16] 4x4 stride 4  |         2362.0        |        4210.3      |         2114.0
      [64, 256, 16, 16, 16] 4x4 stride 2  |         4102.4        |        5897.4      |         3179.7
      [64, 256, 16, 16, 16] 4x4 stride 1  |        11339.3        |       13116.6      |        10032.6
      [64, 256, 16, 16, 16] 2x2 stride 4  |         1709.7        |        3506.7      |         1423.6
      [64, 256, 16, 16, 16] 2x2 stride 2  |         2966.6        |        4760.8      |         2499.3
      [64, 256, 16, 16, 16] 2x2 stride 1  |         6998.4        |        8797.3      |         6152.0
      [64, 256, 4, 4, 4] 4x4 stride 4     |          173.0        |         176.3      |          127.9
      [64, 256, 4, 4, 4] 4x4 stride 2     |          149.1        |         176.3      |          125.5
      [64, 256, 4, 4, 4] 4x4 stride 1     |          150.0        |         177.2      |          125.6
      [64, 256, 4, 4, 4] 2x2 stride 4     |          158.0        |         192.7      |          127.9
      [64, 256, 4, 4, 4] 2x2 stride 2     |          169.7        |         199.2      |          125.3
      [64, 256, 4, 4, 4] 2x2 stride 1     |          289.6        |         318.2      |          116.5
      [64, 64, 32, 32, 32] 4x4 stride 4   |         3914.4        |        6993.3      |         4141.4
      [64, 64, 32, 32, 32] 4x4 stride 2   |         6107.4        |        9186.4      |         6378.5
      [64, 64, 32, 32, 32] 4x4 stride 1   |        17920.0        |       20993.5      |        23891.1
      [64, 64, 32, 32, 32] 2x2 stride 4   |         3029.7        |        6112.6      |         2895.6
      [64, 64, 32, 32, 32] 2x2 stride 2   |         4787.8        |        7870.6      |         4724.8
      [64, 64, 32, 32, 32] 2x2 stride 1   |        10366.4        |       13446.4      |        12603.8
      [64, 64, 16, 16, 16] 4x4 stride 4   |          605.8        |         962.9      |          499.7
      [64, 64, 16, 16, 16] 4x4 stride 2   |         1037.0        |        1394.8      |          791.6
      [64, 64, 16, 16, 16] 4x4 stride 1   |         2835.4        |        3191.8      |         2484.3
      [64, 64, 16, 16, 16] 2x2 stride 4   |          438.6        |         795.7      |          368.6
      [64, 64, 16, 16, 16] 2x2 stride 2   |          749.1        |        1108.0      |          612.0
      [64, 64, 16, 16, 16] 2x2 stride 1   |         1756.4        |        2112.2      |         1538.5
      [64, 64, 4, 4, 4] 4x4 stride 4      |          132.6        |         163.9      |          115.4
      [64, 64, 4, 4, 4] 4x4 stride 2      |          129.3        |         153.7      |          117.8
      [64, 64, 4, 4, 4] 4x4 stride 1      |          128.0        |         153.8      |          117.6
      [64, 64, 4, 4, 4] 2x2 stride 4      |          128.2        |         154.1      |          117.5
      [64, 64, 4, 4, 4] 2x2 stride 2      |          130.5        |         157.3      |          117.6
      [64, 64, 4, 4, 4] 2x2 stride 1      |          128.8        |         156.4      |          120.6
      [64, 3, 32, 32, 32] 4x4 stride 4    |          200.4        |         261.0      |          228.8
      [64, 3, 32, 32, 32] 4x4 stride 2    |          305.3        |         366.5      |          344.4
      [64, 3, 32, 32, 32] 4x4 stride 1    |          860.9        |         922.1      |         1136.0
      [64, 3, 32, 32, 32] 2x2 stride 4    |          157.0        |         216.9      |          158.1
      [64, 3, 32, 32, 32] 2x2 stride 2    |          240.5        |         300.9      |          247.7
      [64, 3, 32, 32, 32] 2x2 stride 1    |          503.5        |         565.1      |          609.8
      [64, 3, 16, 16, 16] 4x4 stride 4    |          136.0        |         159.0      |          120.3
      [64, 3, 16, 16, 16] 4x4 stride 2    |          131.2        |         156.9      |          120.0
      [64, 3, 16, 16, 16] 4x4 stride 1    |          146.6        |         158.5      |          123.8
      [64, 3, 16, 16, 16] 2x2 stride 4    |          133.8        |         158.4      |          117.1
      [64, 3, 16, 16, 16] 2x2 stride 2    |          132.1        |         160.8      |          117.9
      [64, 3, 16, 16, 16] 2x2 stride 1    |          133.7        |         174.4      |          118.0
      [64, 3, 4, 4, 4] 4x4 stride 4       |          156.8        |         166.2      |          119.4
      [64, 3, 4, 4, 4] 4x4 stride 2       |          126.8        |         150.4      |          118.2
      [64, 3, 4, 4, 4] 4x4 stride 1       |          125.2        |         151.7      |          117.8
      [64, 3, 4, 4, 4] 2x2 stride 4       |          127.3        |         152.7      |          116.2
      [64, 3, 4, 4, 4] 2x2 stride 2       |          128.6        |         153.3      |          114.6
      [64, 3, 4, 4, 4] 2x2 stride 1       |          128.6        |         153.5      |          114.7
      [16, 256, 32, 32, 32] 4x4 stride 4  |         3921.7        |        8445.7      |         4064.7
      [16, 256, 32, 32, 32] 4x4 stride 2  |         6111.7        |       10630.0      |         6944.4
      [16, 256, 32, 32, 32] 4x4 stride 1  |        17938.9        |       22896.8      |        26648.7
      [16, 256, 32, 32, 32] 2x2 stride 4  |         3029.6        |        7552.7      |         2840.9
      [16, 256, 32, 32, 32] 2x2 stride 2  |         4788.0        |        9322.1      |         5110.5
      [16, 256, 32, 32, 32] 2x2 stride 1  |        10363.7        |       14885.9      |        13213.6
      [16, 256, 16, 16, 16] 4x4 stride 4  |          606.0        |        1059.1      |          535.9
      [16, 256, 16, 16, 16] 4x4 stride 2  |         1037.5        |        1491.5      |          822.3
      [16, 256, 16, 16, 16] 4x4 stride 1  |         2835.4        |        3306.8      |         2522.8
      [16, 256, 16, 16, 16] 2x2 stride 4  |          438.6        |         892.3      |          369.0
      [16, 256, 16, 16, 16] 2x2 stride 2  |          749.2        |        1203.7      |          638.7
      [16, 256, 16, 16, 16] 2x2 stride 1  |         1756.1        |        2212.5      |         1547.0
      [16, 256, 4, 4, 4] 4x4 stride 4     |          159.6        |         187.6      |          117.6
      [16, 256, 4, 4, 4] 4x4 stride 2     |          161.1        |         185.5      |          117.3
      [16, 256, 4, 4, 4] 4x4 stride 1     |          160.0        |         148.1      |          117.8
      [16, 256, 4, 4, 4] 2x2 stride 4     |          123.9        |         148.3      |          117.6
      [16, 256, 4, 4, 4] 2x2 stride 2     |          126.0        |         151.7      |          117.4
      [16, 256, 4, 4, 4] 2x2 stride 1     |          127.1        |         152.3      |          117.9
      [16, 64, 32, 32, 32] 4x4 stride 4   |          983.5        |        1756.7      |         1067.8
      [16, 64, 32, 32, 32] 4x4 stride 2   |         1542.4        |        2315.2      |         1621.5
      [16, 64, 32, 32, 32] 4x4 stride 1   |         4498.7        |        5273.4      |         6006.7
      [16, 64, 32, 32, 32] 2x2 stride 4   |          767.2        |        1543.4      |          736.7
      [16, 64, 32, 32, 32] 2x2 stride 2   |         1207.8        |        1981.5      |         1197.0
      [16, 64, 32, 32, 32] 2x2 stride 1   |         2603.3        |        3367.5      |         3161.9
      [16, 64, 16, 16, 16] 4x4 stride 4   |          169.5        |         264.6      |          142.8
      [16, 64, 16, 16, 16] 4x4 stride 2   |          274.6        |         368.9      |          216.8
      [16, 64, 16, 16, 16] 4x4 stride 1   |          723.3        |         820.4      |          643.2
      [16, 64, 16, 16, 16] 2x2 stride 4   |          131.4        |         216.0      |          116.1
      [16, 64, 16, 16, 16] 2x2 stride 2   |          199.9        |         295.0      |          166.8
```
CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80748
Approved by: https://github.com/ngimel
2022-07-08 04:26:01 +00:00
Michael Gschwind
25449292a0 Run mask test with and without nested tensor (#81008)
Summary: Run mask test with and without nested tensor

Test Plan: sandcastle

Differential Revision: D37665532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81008
Approved by: https://github.com/malfet
2022-07-07 23:54:37 +00:00
Animesh Jain
1d90d6ee60 Setup for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106)
@ezyang I am going to keep adding more skips in this PR for now. And once we have the CI running, I will replace with the appropriate decorators.

cc @mlazos , we should add those tests in test_ops.py in this PR as well

cc @jansel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80106
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-07-07 18:57:33 +00:00
albanD
c8d64ba5ec Allow register float16 weight_norm on cpu and speed up test (#80600)
Fixes https://github.com/pytorch/pytorch/issues/80599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80600
Approved by: https://github.com/malfet
2022-06-30 13:50:39 +00:00
otaj
db52e4b7d9 Bugfix/weakref (#80139)
Fixes #78580

I'm back! :)

cc @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80139
Approved by: https://github.com/albanD
2022-06-28 14:51:42 +00:00
Rohit Goswami
72e40d2bc7 BUG: Evade segfault by throwing a RuntimeError for nn.ChannelShuffle and empty input tensors (#77029)
Fixes #76616.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77029
Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser
2022-06-23 21:14:02 +00:00
Michael Gschwind
bcc02769be Check for contiguous well-formed mask (#79927)
Summary: Check for contiguous well-formed mask

Test Plan: sandcastle, github CI

Reviewed By: frank-wei

Differential Revision: D37301243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79927
Approved by: https://github.com/jbschlosser
2022-06-23 15:41:04 +00:00
Alex Hedges
cb2b7b1e57 Fix code that triggers BytesWarning (#79868)
Fixes #74812.

I have fixed the multiple instances in the repository that trigger
`BytesWarning`, and I have enabled the `-bb` option when tests are run
to prevent regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79868
Approved by: https://github.com/janeyx99
2022-06-21 01:12:21 +00:00
Joel Benjamin Schlosser
5953fd9133 Revert behavior of Dropout2d on 3D inputs to 1D channel-wise dropout behavior & warn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79549

Approved by: https://github.com/ngimel, https://github.com/albanD
2022-06-15 14:56:43 +00:00
Joel Benjamin Schlosser
2d73c8e6e0 Add Dropout1d module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79545

Approved by: https://github.com/ngimel, https://github.com/albanD
2022-06-15 14:39:07 +00:00
Kurt Mohler
4cfd09d7bc Reland: Add index value checking to MaxUnpool2d and MaxUnpool3d (#78280)
Relanding #70545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78280
Approved by: https://github.com/jbschlosser
2022-06-03 20:09:07 +00:00
samdow
b7cb4eae6b Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560)
On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](369d9f4137) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked.

What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560
Approved by: https://github.com/soulitzer, https://github.com/albanD
2022-06-03 19:14:51 +00:00
Eddie Yan
14b0e9e75f [cuDNN] Don't enforce bitwise exact results in test_conv_transposed_large_cuda (#78147)
`test_conv_transposed_large` expects bitwise perfect results in fp16 on CUDA, but this behavior isn't guaranteed by cuDNN (e.g., in the case of FFT algos).

This PR just changes the tolerance on the test to account for these cases.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78147
Approved by: https://github.com/ngimel
2022-06-03 19:03:24 +00:00
Eddie Yan
b740a99b9e [cuDNN][TF32] Threshold adjustments for TF32 on >=sm80 (#78437)
CC @ptrblck @mcarilli

Change to transformer multilayer test can potentially be swapped in favor of an rtol change? (see also: #75612).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78437
Approved by: https://github.com/ngimel
2022-06-03 01:02:56 +00:00
PyTorch MergeBot
d578197747 Revert "Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560)"
This reverts commit ce7c7bb2a9.

Reverted https://github.com/pytorch/pytorch/pull/78560 on behalf of https://github.com/malfet due to broke XLA (on CI and trunk), see ce7c7bb2a9
2022-06-02 17:40:34 +00:00
samdow
ce7c7bb2a9 Fix embedding jvp support by making embedding_renorm ignore forward mode AD (#78560)
On functorch, we started seeing [embedding forward mode fail](https://github.com/pytorch/functorch/pull/816). From looking at it, we figured out that recently [embedding got forward mode support enabled](369d9f4137) and then doing forward mode with embedding and [max_norm doesn't work with gradcheck](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_methods_invocations.py#L8877-L8881), so it's not checked.

What was happening is that `embedding_renorm` was setting `torch.no_grad()` which only turns off the backwards mode AD so functorch's jvp tests were still using forward mode AD during the `embedding_renorm` call. This makes it so that we don't use forward mode during the embedding_renorm call
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78560
Approved by: https://github.com/soulitzer, https://github.com/albanD
2022-06-02 13:40:21 +00:00
Edward Z. Yang
c20969c40c Fix ParameterList printing meta tensor
Fixes https://github.com/pytorch/pytorch/issues/78250

There are actually two bugs.  First, the crash is caused
by TensorOptions::backend incorrectly reporting noexcept when
it can failed.  Second, ParameterList is using torch.tensortype
for no good reason; we can just print the dtype instead.

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78529

Approved by: https://github.com/albanD
2022-06-01 00:46:52 +00:00
mikeiovine
d6db5ea50d Back out "add mixed data type mode for LayerNorm forward path"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78298

Also back out "improve LayerNorm bfloat16 performance on CPU".

These layer norm changes seem fine, but they are causing `LayerNorm` to not use AVX2 instructions, which is causing performance on internal models to degrade. More investigation is needed to find the true root cause, but we should unland to mitigate the issue ASAP.

I left `mixed_data_type.h` around since there are some other files depending on it.

Differential Revision: [D36675352](https://our.internmc.facebook.com/intern/diff/D36675352/)

Approved by: https://github.com/tenpercent
2022-05-26 02:54:13 +00:00
PyTorch MergeBot
c50089712c Revert "Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545)"
This reverts commit 53ef66bb59.

Reverted https://github.com/pytorch/pytorch/pull/70545 on behalf of https://github.com/malfet due to as it broke cuda-10.2 test on trunk, see 53ef66bb59
2022-05-23 23:58:43 +00:00
Kurt Mohler
53ef66bb59 Add index value checking to MaxUnpool2d and MaxUnpool3d (#70545)
Fixes #68727

cc @mruberry @jbschlosser @walterddr @kshitij12345 @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70545
Approved by: https://github.com/ngimel
2022-05-23 21:08:25 +00:00
yuguo68
c186250d95 raise error when groups is not positive in Conv modules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77919

Approved by: https://github.com/jbschlosser
2022-05-23 20:35:00 +00:00
Jeff Daily
9aed30d3ad [ROCm] support benchmark flag for MIOpen (#77438)
Fixes #68172.  Generally, this corrects multiple flaky convolution unit test behavior seen on ROCm.

The MIOpen integration has been forcing benchmark=True when calling `torch._C._set_cudnn_benchmark(False)`, typically called by `torch.backends.cudnn.set_flags(enabled=True, benchmark=False)`.  We now add support for MIOpen immediate mode to avoid benchmarking during MIOpen solution selection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77438
Approved by: https://github.com/ngimel, https://github.com/malfet
2022-05-23 17:10:24 +00:00
zrphercule
734a97a7c8 Revert "Revert "Switch to use nested tensor by-default in Transformer… (#77924)
…Encoder (#77217)""

This reverts commit 0d6fa91d1b.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77924
Approved by: https://github.com/atalman
2022-05-20 11:44:03 +00:00
George Qi
f9db8b72ac MHA forward pass bug fix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77761

Approved by: https://github.com/jbschlosser
2022-05-19 01:21:24 +00:00
Joel Benjamin Schlosser
8881d7ac6c Support no-batch-dim for CrossEntropyLoss with prob target
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77653

Approved by: https://github.com/albanD
2022-05-18 19:51:09 +00:00
Nikita Vedeneev
a760dc2687 binary_cross_entropy: double backwart wrt target (#77416)
As per title. An effort to make `binary_cross_entropy` all around differentiable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77416
Approved by: https://github.com/soulitzer
2022-05-18 10:29:27 +00:00
Rui Zhu
4e2f5507d0 Add support for TxT mask layout for masked_softmax in BetterTransformer (#77607)
Summary: Expand mask to BxHxDxD when mask is DxD layout

Test Plan: buck build mode/opt -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/opt/gen/caffe2/test/nn\#binary.par -r masked_softmax_DxD

Differential Revision: D36428170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77607
Approved by: https://github.com/cpuhrsch
2022-05-18 01:31:05 +00:00
PyTorch MergeBot
d8b80edade Revert "Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435)"
This reverts commit 1aa3cbb83b.

Reverted https://github.com/pytorch/pytorch/pull/76435 on behalf of https://github.com/jbschlosser
2022-05-17 17:51:26 +00:00
mingfeima
c003494754 add channels last support for PixelShuffle and PixelUnshuffle
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50573

Approved by: https://github.com/VitalyFedyunin
2022-05-17 17:33:49 +00:00
Edward Z. Yang
b5bc954a71 Fix optional dtype/layout/memory_format pycall; fix memory format
Double-header bug fix:

- As reported by jansel, dtypes are still showing up as integers
  when the schema is an optional dtype.  This is simple enough to
  fix and I added a test for it.  But while I was at it...

- I noticed that the THPMemoryFormat_new idiom with "unused" name
  doesn't actually work, the repr of the returned memory format
  object is wrong and this shows up when we try to log the args/kwargs.
  So I fixed memory format to do it properly along with everything
  else.

Fixes https://github.com/pytorch/pytorch/issues/77135

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77543

Approved by: https://github.com/albanD, https://github.com/jansel
2022-05-16 16:46:08 +00:00
mingfeima
8c50414233 add BFloat16 support for BatchNorm on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77496

Approved by: https://github.com/frank-wei
2022-05-16 16:31:18 +00:00
mingfeima
6fa20bdfe8 add native kernel for weight_norm on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73845

Approved by: https://github.com/frank-wei
2022-05-16 06:36:24 +00:00
PyTorch MergeBot
93a969221d Revert "add BFloat16 support for BatchNorm on CPU"
This reverts commit 7c8911ca7a.

Reverted https://github.com/pytorch/pytorch/pull/74410 on behalf of https://github.com/albanD
2022-05-14 14:28:58 +00:00
mingfeima
7c8911ca7a add BFloat16 support for BatchNorm on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74410

Approved by: https://github.com/frank-wei
2022-05-14 07:49:00 +00:00
Rohan Varma
a275491c6f [Reland] load_state_dict post hook (#77392)
Reland of https://github.com/pytorch/pytorch/pull/76823 with fixes to call `__setstate__` for softmax/softmin/logsoftmax as per discussion with @albanD and @jbschlosser. Original description:

Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287.

Unittests cover:
- Ensuring hooks are called with the correct module
- Hook is called with `IncompatibleKeys` field
- If hook modifies this, load_state_dict returns the modified result

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77392
Approved by: https://github.com/jbschlosser
2022-05-14 06:06:23 +00:00
mingfeima
59b56ba785 improve group_norm channels last performance on CPU
add channels_last_3d memory format support

add BFloat16 support on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69067

Approved by: https://github.com/VitalyFedyunin
2022-05-14 03:13:02 +00:00
Kulin Seth
e011a8e18b Enable PyTorch operations on MPS Backend. (#77343)
Add PyTorch operations to MPS backend.

- https://github.com/pytorch/pytorch/issues/77394
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77343
Approved by: https://github.com/albanD
2022-05-13 18:28:53 +00:00
mingfeima
2b7943c47c fix torchvhsion failed case test_classification_model on slow_conv2d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77347

Approved by: https://github.com/datumbox, https://github.com/frank-wei
2022-05-13 08:04:08 +00:00
PyTorch MergeBot
d92b0a51aa Revert "Load state dict post hook"
This reverts commit 56bed0dcfe.

Reverted https://github.com/pytorch/pytorch/pull/76823 on behalf of https://github.com/rohan-varma
2022-05-12 21:00:49 +00:00
ecao
37c6017831 Add BFloat16 support for GLU, and randperm operators on CPU (#61944)
add BFloat16 support for GLU and randperm operators on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61944
Approved by: https://github.com/frank-wei
2022-05-12 17:41:57 +00:00
yanbing-j
4f82f439d1 Enable BFloat16 ELU, SELU and CELU in CPU path (#62546)
Enable BFloat16 ELU, SELU and CELU in CPU path. SELU and CELU will call ELU implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62546
Approved by: https://github.com/frank-wei
2022-05-12 16:56:57 +00:00
mingfeima
3b56efd4e1 add mixed data type mode for LayerNorm forward path
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73844

Approved by: https://github.com/frank-wei
2022-05-12 03:35:06 +00:00
otaj
1aa3cbb83b Use weakref.proxy when saving module to internal dictionaries to not increase refcount (#76435)
Fixes #76434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76435
Approved by: https://github.com/jbschlosser
2022-05-11 18:40:59 +00:00
mingfeima
3d0e6f169c add channels last support for slow_conv_dilated2d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70665

Approved by: https://github.com/VitalyFedyunin
2022-05-11 15:28:50 +00:00
Rui Zhu
533b44a280 Add _native nested_tensor_from_mask (#76942)
Summary: For user to convert nested tensor more easily. Some impl detail might change on user's need.

Test Plan: buck test mode/dev caffe2/test:nn -- test_nested_tensor_from_mask

Differential Revision: D36191182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76942
Approved by: https://github.com/jbschlosser
2022-05-11 05:19:36 +00:00
mingfeima
3d561ee926 add channels last support for thnn_conv2d (non-dilated)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68101

Approved by: https://github.com/VitalyFedyunin
2022-05-11 00:09:45 +00:00
neverix
87e543da9b Add load_state_dict error message for non-dicts (#77197)
Fixes #76886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77197
Approved by: https://github.com/jbschlosser
2022-05-10 22:11:51 +00:00
Aidyn-A
a127c584a0 Fix max pool forward nhwc (#76597)
Fixes issue #76432.

Added dilation to loops in CUDA kernel.

cc @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76597
Approved by: https://github.com/ngimel
2022-05-10 17:39:48 +00:00
mingfeima
8d4e069e66 add BFloat16 support for UpSample on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76935

Approved by: https://github.com/frank-wei
2022-05-10 16:56:41 +00:00
Scott Wolchok
e5915a2216 [PyTorch] Don't enter MHA fast path when bias & query dtypes don't match
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76879

The fast path does not support this: transform_bias_rescale_qkv will try to grab bias.data_ptr() assuming the dtypes are the same. (Also, I have no idea how this happens.)

Differential Revision: [D36156872](https://our.internmc.facebook.com/intern/diff/D36156872/)

Approved by: https://github.com/cpuhrsch
2022-05-09 18:21:04 +00:00
Rohan Varma
56bed0dcfe Load state dict post hook
Implements `register_load_state_dict_post_hook` API as discussed in https://github.com/pytorch/pytorch/issues/75287.

Unittests cover:
- Ensuring hooks are called with the correct module
- Hook is called with `IncompatibleKeys` field
- If hook modifies this, load_state_dict returns the modified result

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76823
Approved by: https://github.com/albanD
2022-05-05 19:27:05 +00:00
lkct
b8776e143f Fix false DeprecationWarning in Module.state_dict
Fixes #75404

TODO:
- [x] add tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75507
Approved by: https://github.com/jbschlosser
2022-05-04 20:08:23 +00:00
Nikita Shulga
b074bffa41 Revert D28836788: add BFloat16 support for UpSample on CPU
Test Plan: revert-hammer

Differential Revision:
D28836788 (1399d83bc0)

Original commit changeset: 63dc45e5bb91

Original Phabricator Diff: D28836788 (1399d83bc0)

fbshipit-source-id: 92733af87cba87aed800473ff44ca6d7af037da9
(cherry picked from commit 1c9fc492503b768a343723e4cf347b30bf5dcfc2)
2022-05-02 23:13:39 +00:00
mingfeima
1399d83bc0 add BFloat16 support for UpSample on CPU (#58297)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58297

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D28836788

Pulled By: VitalyFedyunin

fbshipit-source-id: 63dc45e5bb91964d5ff1110262228718289435d1
(cherry picked from commit 8a37d607d6a89ccb50364cf54a6f26ca8d05cab9)
2022-05-02 22:33:26 +00:00
Scott Wolchok
e816e17655 [PyTorch] Add native fast path for transformer encoder inference (#76333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76333

The current PyTorch multi-head attention and transformer
implementations are slow. This should speed them up for inference.
ghstack-source-id: 154737857

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: cpuhrsch

Differential Revision: D35239925

fbshipit-source-id: 5a7eb8ff79bc6afb4b7d45075ddb2a24a6e2df28
2022-04-26 12:58:03 -04:00
Jon Janzen
2387efd356 Revert "[PyTorch] Add native fast path for transformer encoder inference"
This reverts commit b369b89f23.

This has internal changes and should not have been landed via mergebot.

Ref: https://github.com/pytorch/pytorch/pull/75809#issuecomment-1108717166
2022-04-25 11:40:02 -04:00
Scott Wolchok
b369b89f23 [PyTorch] Add native fast path for transformer encoder inference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75809

The current PyTorch multi-head attention and transformer
implementations are slow. This should speed them up for inference.

Differential Revision: [D35239925](https://our.internmc.facebook.com/intern/diff/D35239925/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35239925/)!

Approved by: https://github.com/ezyang
2022-04-25 06:11:36 +00:00
Peter Bell
cb37e7a080 Remove F.pad python implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73433

Approved by: https://github.com/albanD, https://github.com/jbschlosser
2022-04-23 00:13:20 +00:00
Joel Benjamin Schlosser
041e6e750a Fix to support no-batch-dim inputs in ConvTransposeNd._output_padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76151

Approved by: https://github.com/albanD
2022-04-22 19:25:09 +00:00
Nikita Vedeneev
9e137ee583 more numerically stable cosine_similarity
**Previous behavior**: compute inner product, then normalize.
**This patch**: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms.
By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [#29442](https://github.com/pytorch/pytorch/issues/29442).

P.S. I had to change tests because this implementation handles division by 0 differently.
This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>.
Let f(x,y) = <x,y>/(||x|| * ||y||), then
df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2.
The changed test checks division by zero in backward when x=0 and y != 0.
For this case the non-zero part of the gradient is just y / (||x|| * ||y||).
The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31378
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-04-22 09:28:50 +00:00
arindamroy-eng
7478ce187a ROCM:Unskip more tests for ROCM5.0
Re-enabling more tests which are working on ROCM5.0

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75353
Approved by: https://github.com/ezyang
2022-04-19 19:45:55 +00:00
George Qi
f5517761aa add operator header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71502

Approved by: https://github.com/zrphercule, https://github.com/cpuhrsch
2022-04-19 15:23:25 +00:00
yanbing-j
dc2e630341 Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path (#63634)
Summary:
In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU.

The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization.

The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel.

In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation.

We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`.

Share weights:
![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png)

![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png)

Multiple weights:
![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png)

![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png)

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634

Reviewed By: yinghai

Differential Revision: D34031616

Pulled By: frank-wei

fbshipit-source-id: 04e2a0f9e92c658fba7ff56b1010eacb7e8ab44c
(cherry picked from commit ed262b15487557720bb0d498f9f2e8fcdba772d9)
2022-04-15 21:46:24 +00:00
PyTorch MergeBot
e8ed042043 Revert "Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path"
This reverts commit 263c4c2a95.

Reverted https://github.com/pytorch/pytorch/pull/63634 on behalf of https://github.com/seemethere
2022-04-15 21:41:51 +00:00
yanbing-j
263c4c2a95 Optimize PReLU (float32) and enable PReLU BFloat16 support in CPU path
In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU.

The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization.

The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel.

In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation.

We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`.

Share weights:
![image](https://user-images.githubusercontent.com/61222868/130403002-ef271bee-0cae-460b-b796-46853599c210.png)

![image](https://user-images.githubusercontent.com/61222868/130403028-96753102-bea3-44c2-8656-2526469e0627.png)

Multiple weights:
![image](https://user-images.githubusercontent.com/61222868/130403059-a3418eb2-9546-471f-b057-15bc0e46f0d0.png)

![image](https://user-images.githubusercontent.com/61222868/130403070-8c620db9-f354-4ddd-b5d5-4557e10ea77a.png)

cc @albanD @mruberry @jbschlosser @walterddr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634
Approved by: https://github.com/frank-wei, https://github.com/seemethere
2022-04-15 20:34:58 +00:00
Scott Wolchok
56f801e788 [PyTorch] Add test for all-masked case for native softmax
It returns all NaNs. CUDA implementation required a fix for this.

Differential Revision: [D35327730](https://our.internmc.facebook.com/intern/diff/D35327730/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75803

Approved by: https://github.com/ngimel
2022-04-14 21:30:57 +00:00
Scott Wolchok
d4c527e738 [PyTorch] Run test_transformerencoderlayer_gelu on CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75347

Preparing to add native fast path; need to test on CUDA!

Differential Revision: [D35327729](https://our.internmc.facebook.com/intern/diff/D35327729/)

Approved by: https://github.com/ngimel
2022-04-14 21:30:57 +00:00
Scott Wolchok
96cf8a450a [PyTorch] Run test_transformerencoderlayer on CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75346

Preparing to add native fast path; need to test on CUDA!

Differential Revision: [D35327731](https://our.internmc.facebook.com/intern/diff/D35327731/)

Approved by: https://github.com/ngimel
2022-04-14 21:30:56 +00:00
Rohan Varma
a4126e5936 Add test to make sure submodule hooks fire
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75421

As part of FSDP work, we will be relying on `_register_load_state_dict_pre_hook` to manage some specific logic related to loading state dicts.

This PR adds a test to ensure that _register_load_state_dict_pre_hook can be
used to register hooks on modules that will be used in a nested way, and then
calling load_state_dict on the overall module still calls those hooks
appropriately.

Differential Revision: [D35434726](https://our.internmc.facebook.com/intern/diff/D35434726/)

Approved by: https://github.com/albanD
2022-04-12 20:11:38 +00:00
HDCharles
25ee52570e [ao][sparsity] comsability for sparsity and QAT convert
Summary: The primary issue for enabling sparsity to work with QAT
convert (unlike normal quantization convert) is that when the
parametrized module undergoes the QAT convert, the parametrizations need
to be maintained. If the parametrizations don't
get transfered during the convert, the sparsifier would lose its
connection to the model. In practice this was handled using the
transfer_parametrizations_and_params function to move the weight and
bias and any associated paramerizations to the new module. This PR also adds
tests for transfer_parametrizations_and_params and type_before_parametrizations
to test_nn.py and also added comments to the test code for
composability.

Test Plan: python test/test_ao_sparsity.py TestComposability
python test/test_nn.py TestNN

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74848

Approved by: https://github.com/vkuzo, https://github.com/Lezcano
2022-04-11 16:32:08 +00:00
kshitij12345
e177d2cc44 [complex] conv3d
Reference: https://github.com/pytorch/pytorch/issues/71108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75581
Approved by: https://github.com/anjali411
2022-04-10 19:37:10 +00:00
soulitzer
b10d151745 Ensure convolution_backward respects output_mask
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75298

Approved by: https://github.com/albanD
2022-04-08 19:27:41 +00:00
Kshiteej K
fe799374de [complex] conv2d
Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75412
Approved by: https://github.com/anjali411
2022-04-08 16:26:39 +00:00
kshitij12345
706b9e8b8d [reland] [complex] conv1d
Reland : #75013

Reference: #71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75310
Approved by: https://github.com/anjali411
2022-04-06 17:12:41 +00:00
Jianyu Huang
b8a4708ac0 [pt] Add half precision support for nn.EmbeddingBag (CPU) (#74844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74844

- Use FBGEMM/perf kernel implementation for the fast path.
- Use FP32 accumulation for FP16 weight embeddings (`index_select_add` and `index_select_scale_add`).
- Add the unit test coverage.

Test Plan:
```
buck run mode/opt //ai_codesign/nonprod/jianyuhuang/pytorch_examples:eb
Parsing buck files: finished in 0.6 sec
Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 01:52.1 min (100%) 12247/12247 jobs, 2/12247 updated
  Total time: 01:52.8 min
BUILD SUCCEEDED
tensor([[ 0.1282, -0.0244,  1.0996],
        [-1.2285, -0.8643,  2.6621]], dtype=torch.float16,
       grad_fn=<EmbeddingBagBackward0>)
tensor([[[-0.1643,  0.1266, -0.4851],
         [ 0.0710,  0.5024,  0.2798],
         [ 0.4797,  0.5991, -0.0188],
         [ 0.8843,  1.2090,  1.6494]],

        [[ 0.4797,  0.5991, -0.0188],
         [ 0.0662, -0.4121,  1.5752],
         [ 0.0710,  0.5024,  0.2798],
         [-0.8242,  0.2668, -0.6177]]], dtype=torch.float16,
       grad_fn=<EmbeddingBackward0>)
```

```
$ buck run mode/opt //caffe2/test:nn -- -r test_embedding_bag_half 2>&1 | tee output.log
PARSING BUCK FILES: FINISHED IN 0.8s
CREATING ACTION GRAPH: FINISHED IN 0.0s
test_embedding_bag_half_cpu_int32_int32 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cpu_int32_int64 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cpu_int64_int32 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cpu_int64_int64 (test_nn.TestNNDeviceTypeCPU) ... ok
test_embedding_bag_half_cuda_int32_int32 (test_nn.TestNNDeviceTypeCUDA) ... ok
test_embedding_bag_half_cuda_int32_int64 (test_nn.TestNNDeviceTypeCUDA) ... ok
test_embedding_bag_half_cuda_int64_int32 (test_nn.TestNNDeviceTypeCUDA) ... ok
test_embedding_bag_half_cuda_int64_int64 (test_nn.TestNNDeviceTypeCUDA) ... ok

----------------------------------------------------------------------
Ran 8 tests in 44.621s

OK
```

```
TORCH_SHOW_CPP_STACKTRACES=1  buck run mode/opt //caffe2/test:nn -- -r test_EmbeddingBag_per_sample_weights_and_new_offsets 2>&1 | tee output.log
```

Reviewed By: jasonjk-park

Differential Revision: D35190299

fbshipit-source-id: d1daa6e837660259b92a1f316b09f38e509ee077
(cherry picked from commit 86f575f2c4cd407c13d4b2eaeea94b59f74642af)
2022-04-06 01:27:46 +00:00
PyTorch MergeBot
862f67454f Revert "[complex] conv1d"
This reverts commit b64e7dee51.

Reverted https://github.com/pytorch/pytorch/pull/75013 on behalf of https://github.com/mruberry
2022-04-05 20:18:50 +00:00
kshitij12345
b64e7dee51 [complex] conv1d
Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75013
Approved by: https://github.com/anjali411
2022-04-05 17:29:59 +00:00
CaoE
77c7a50d46 Add BFloat16 support for logsigmoid, hardsigmoid, hardshrink, softshrink, hardswish and softplus on CPU (#63134)
Summary:
Add BFloat16 support for logsigmoid, hardsigmoid, hardshrink, softshrink,  hardswish and softplus  on CPU,  and optimize the performance of softshrink.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63134

Reviewed By: yinghai

Differential Revision: D34897992

Pulled By: frank-wei

fbshipit-source-id: 4c778f5271d6fa54dd78158258941def8d9252f5
(cherry picked from commit decda0e3debf56cc5c4d7faea41b1165a7cabe12)
2022-04-04 20:31:22 +00:00
Nikita Karetnikov
936a65056e Use the same checks in all grid_sampler functions
Fixes #73187.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75164
Approved by: https://github.com/albanD
2022-04-04 15:21:44 +00:00
CaoE
c5872e6d6d Add BFloat16 support for smooth_l1_loss on CPU (#62558)
Summary:
Add BFloat16 support for smooth_l1_loss on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62558

Reviewed By: H-Huang

Differential Revision: D34897859

Pulled By: frank-wei

fbshipit-source-id: a52138c89852642db78f5f3083d05873f3cdec3a
(cherry picked from commit 71908ee3de7ca0580a073350353ce6f234a8c6ff)
2022-04-04 06:05:46 +00:00
Scott Wolchok
8f4f1638bb [PyTorch] Flip polarity of masked_softmax mask (#78)
Summary:
X-link: https://github.com/pytorch/pytorch-canary/pull/78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75039

It didn't match torch.nn.MultiheadAttention. Now it does.
ghstack-source-id: 152815449

Test Plan: updated tests

Reviewed By: zrphercule

Differential Revision: D34929186

fbshipit-source-id: 1eaee615bafd5a6f058f1faefa54f8f4aa01c92e
(cherry picked from commit 00eea72a06fb924112f1036c8f0c5ed08eb0d02c)
2022-04-02 00:17:49 +00:00
Kyle Chen
f888dc5842 [ROCm] re-enable test_Conv2d_groups_nobias tests
fixes:
https://github.com/pytorch/pytorch/pull/59158
https://github.com/pytorch/pytorch/pull/58701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75008
Approved by: https://github.com/albanD
2022-04-01 21:42:23 +00:00
Alban Desmaison
0ce02ea52d Revert D35284563: Use the same checks in all grid_sampler functions
Test Plan: revert-hammer

Differential Revision:
D35284563 (835cc66e5d)

Original commit changeset: 1477c506b875

Original Phabricator Diff: D35284563 (835cc66e5d)

fbshipit-source-id: 7260f4dfda23bd60200e5ba2c5bf3e4f833c2646
(cherry picked from commit fbe082905ef678e7dd70dbc9520dca644383ce01)
2022-04-01 16:45:46 +00:00
Nikita Karetnikov
835cc66e5d Use the same checks in all grid_sampler functions (#74635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74635

Fixes #73187.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D35284563

Pulled By: albanD

fbshipit-source-id: 1477c506b8755d864ca902ee140bee7bdb0069b0
(cherry picked from commit dcbd5242baaae11f9e323d99a9596e5b88e86bd7)
2022-04-01 14:26:16 +00:00
Nikita Shulga
bfac65dfe5
[testing] Update dispatch macros (#74977)
This PR is reland of #74289 
Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
2022-03-30 14:13:21 -07:00
PyTorch MergeBot
2e4152b118 Revert "[testing] Update dispatch macros"
This reverts commit eed19a0f38.

Reverted https://github.com/pytorch/pytorch/pull/74289 on behalf of https://github.com/malfet
2022-03-30 19:52:37 +00:00
kshitij12345
273c2f0124 EmbeddingBagCUDA: remove oob check for perf
Fixes: https://github.com/pytorch/pytorch/issues/74751
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74767
Approved by: https://github.com/ngimel, https://github.com/xuzhao9
2022-03-30 16:55:49 +00:00
Khushi Agrawal
eed19a0f38 [testing] Update dispatch macros
Hi,
This PR is the follow-up PR of #71561. (the previous PR had a couple of merge conflicts and was reverted, this PR resolves that).
Please take a look. Thanks!

cc: @pmeier @mruberry @kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74289
Approved by: https://github.com/pmeier, https://github.com/mruberry
2022-03-30 16:10:16 +00:00
Michael
3157b1abd7 CrossEntropyLoss triggers floating point exception
Adds cast to float to fix floating point exception

Fixes #73165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73837
Approved by: https://github.com/jbschlosser
2022-03-22 14:20:30 +00:00
XiaobingSuper
4a0f6e6c53 report an error if num_channels is not divisible by num_groups for nn.GroupNorm
For a GroupNorm module, if num_channels is not divisible by num_groups, we need to report an error when defining a module other than at the running step.

example:
```
import torch
m = torch.nn.GroupNorm(5, 6)
x = torch.randn(1, 6, 4, 4)
y = m(x)
```
before:

```
Traceback (most recent call last):
  File "group_norm_test.py", line 8, in <module>
    y = m(x)
  File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 271, in forward
    input, self.num_groups, self.weight, self.bias, self.eps)
  File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/functional.py", line 2500, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected number of channels in input to be divisible by num_groups, but got input of shape [1, 6, 4, 4] and num_groups=5
```

after:

```
Traceback (most recent call last):
  File "group_norm_test.py", line 6, in <module>
    m = torch.nn.GroupNorm(5, 6)
  File "/home/xiaobinz/miniconda3/envs/pytorch_test/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 251, in __init__
    raise ValueError('num_channels must be divisible by num_groups')
```

This PR also update the doc of num_groups.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74293
Approved by: https://github.com/jbschlosser
2022-03-17 13:40:47 +00:00
Nikita Shulga
ef066f0832 Revert D34856571: [pytorch][PR] Replace get_all_ type macros with the ATen dispatch macros.
Test Plan: revert-hammer

Differential Revision:
D34856571 (3ded7b1da3)

Original commit changeset: 0dca038bcad5

Original Phabricator Diff: D34856571 (3ded7b1da3)

fbshipit-source-id: 594553fa0b710d78beba59d5d2b646f1f1270386
(cherry picked from commit 8090eb9b12dcf452a9e7dc01792a66fb91b563b6)
2022-03-15 22:07:11 +00:00
Khushi Agrawal
3ded7b1da3 Replace get_all_ type macros with the ATen dispatch macros. (#71561)
Summary:
Hi, Team!
The PR is motivated from https://github.com/pytorch/pytorch/pull/71153#discussion_r782446738. It aims to replace `get_all` type macros with the ATen dispatch macros.

The files it iterates over are: (Thanks, Lezcano, for the idea!!)

<details>
<summary>

`test/test_autograd.py`</summary>

<p>

```python
43:from torch.testing._internal.common_dtype import get_all_dtypes
8506:        floating_dt = [dt for dt in get_all_dtypes() if dt.is_floating_point]
```

</p>
</details>

<details>
<summary>

`test/test_binary_ufuncs.py`</summary>

<p>

```python
26:    all_types_and_complex_and, integral_types_and, get_all_dtypes, get_all_int_dtypes, get_all_math_dtypes,
27:    get_all_complex_dtypes, get_all_fp_dtypes,
935:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1035:    dtypes(*get_all_dtypes(
1488:    dtypes(*(get_all_dtypes(include_bool=False, include_bfloat16=False)))
1879:    dtypes(*product(get_all_dtypes(include_complex=False), get_all_dtypes(include_complex=False)))
1887:    dtypes(*(get_all_int_dtypes() + [torch.bool]))
1913:    dtypes(*(get_all_fp_dtypes()))
1941:    dtypes(*(get_all_fp_dtypes()))
1977:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
2019:    dtypes(*product(get_all_fp_dtypes(), get_all_fp_dtypes()))
2048:    dtypes(*get_all_dtypes())
2110:    dtypes(*product(get_all_dtypes(include_complex=False),
2111:                     get_all_dtypes(include_complex=False)))
2128:            types = [torch.bool, torch.bfloat16] + get_all_int_dtypes()
2173:        if dtypes[1] in get_all_fp_dtypes():
2178:    dtypes(*product(get_all_fp_dtypes(),
2179:                     get_all_fp_dtypes()))
2260:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2261:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2273:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')) - {torch.complex64, torch.complex128})
2274:    dtypes(*set(get_all_math_dtypes('cpu')) - {torch.complex64, torch.complex128})
2307:    dtypes(*get_all_math_dtypes('cpu'))
2319:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
2331:    dtypes(*get_all_int_dtypes())
2356:    dtypes(*get_all_dtypes(include_bfloat16=False, include_bool=False, include_complex=False))
2393:        if dtype in get_all_int_dtypes():
2614:    dtypes(*get_all_dtypes())
2624:    dtypes(*tuple(itertools.combinations_with_replacement(get_all_dtypes(), 2)))
2806:    dtypes(*list(product(get_all_dtypes(include_complex=False),
2807:                          get_all_dtypes(include_complex=False))))
2866:    dtypes(*list(product(get_all_complex_dtypes(),
2867:                          get_all_complex_dtypes())))
2902:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2906:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
2910:    dtypes(*product(get_all_dtypes(), get_all_dtypes()))
3019:        dtypes = [torch.float, torch.double] + get_all_complex_dtypes()
3221:    dtypes(*get_all_dtypes(include_complex=False))
3407:    dtypes(*list(product(get_all_dtypes(include_bool=False),
3408:                          get_all_dtypes(include_bool=False))))
3504:    dtypes(*product(get_all_dtypes(include_complex=False, include_bfloat16=False),
3505:                     get_all_dtypes(include_complex=False, include_bfloat16=False)))
3516:            if x.dtype in get_all_int_dtypes() + [torch.bool]:
3643:    dtypes(*product(get_all_dtypes(include_complex=False,
3645:                     get_all_dtypes(include_complex=False,
```

</p>
</details>

<details>
<summary>

`test/test_complex.py`</summary>

<p>

```python
6:from torch.testing._internal.common_dtype import get_all_complex_dtypes
11:    dtypes(*get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_foreach.py`</summary>

<p>

```python
18:    get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
142:            if dtype in get_all_int_dtypes():
179:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
201:            disable_fastpath = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
205:                disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
211:                disable_fastpath |= dtype not in get_all_complex_dtypes()
241:                bool_int_div = op.ref == torch.div and dtype in get_all_int_dtypes() + [torch.bool]
246:                    disable_fastpath |= dtype in get_all_int_dtypes() + [torch.bool]
248:                    disable_fastpath |= dtype not in get_all_complex_dtypes()
250:                    disable_fastpath |= True and dtype not in get_all_complex_dtypes()
307:        disable_fastpath = dtype in get_all_int_dtypes() + [torch.bool]
365:        if opinfo.name == "_foreach_abs" and dtype in get_all_complex_dtypes():
376:    ops(foreach_unary_op_db, dtypes=get_all_dtypes())
393:         dtypes=get_all_dtypes(include_half=True, include_bfloat16=True, include_complex=False))
401:    ops(foreach_minmax_op_db, dtypes=get_all_fp_dtypes(include_bfloat16=True, include_half=True))
426:            if ord in (1, 2) and dtype in torch.testing.get_all_fp_dtypes():
439:    dtypes(*get_all_dtypes())
449:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
481:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
536:            if dtype in get_all_int_dtypes() + [torch.bool] and foreach_op == torch._foreach_div:
545:    ops(foreach_binary_op_db, dtypes=get_all_dtypes())
637:    ops(foreach_pointwise_op_db, allowed_dtypes=get_all_fp_dtypes(include_half=False, include_bfloat16=False))
```

</p>
</details>

<details>
<summary>

`test/test_linalg.py`</summary>

<p>

```python
29:    all_types, floating_types, floating_and_complex_types, get_all_dtypes, get_all_int_dtypes, get_all_complex_dtypes,
30:    get_all_fp_dtypes,
111:    dtypes(*(get_all_dtypes()))
794:        float_and_complex_dtypes = get_all_fp_dtypes() + get_all_complex_dtypes()
807:    dtypes(*(get_all_int_dtypes()))
828:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
841:        if dtype in get_all_complex_dtypes():
844:    dtypes(*itertools.product(get_all_dtypes(),
845:                               get_all_dtypes()))
855:        for dtypes0, dtypes1, dtypes2 in product(get_all_dtypes(), repeat=3):
5607:                  *get_all_fp_dtypes(include_half=not CUDA9, include_bfloat16=(CUDA11OrLater and SM53OrLater)))
5608:    dtypes(*(set(get_all_dtypes()) - {torch.half, torch.bool}))
5644:    dtypes(*(get_all_complex_dtypes() + get_all_fp_dtypes()))
6255:    dtypesIfCUDA(*get_all_complex_dtypes(),
6256:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater)),
6292:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6323:    dtypesIfCUDA(*get_all_complex_dtypes(),
6324:                  *get_all_fp_dtypes(include_bfloat16=(TEST_WITH_ROCM or (CUDA11OrLater and SM53OrLater))))
6325:    dtypes(*get_all_complex_dtypes(), *get_all_fp_dtypes())
6358:    dtypesIfCUDA(*([torch.float, torch.double] + get_all_complex_dtypes()))
6556:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6668:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
6741:    dtypes(*get_all_fp_dtypes(), *get_all_complex_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_nn.py`</summary>

<p>

```python
37:from torch.testing._internal.common_dtype import integral_types, get_all_fp_dtypes, get_all_math_dtypes
50:    onlyNativeDeviceTypes, deviceCountAtLeast, largeTensorTest, expectedFailureMeta, skipMeta, get_all_device_types, \
8862:                for device in get_all_device_types():
9629:            for dt1 in get_all_math_dtypes(device):
9630:                for dt2 in get_all_math_dtypes(device):
9631:                    for dt3 in get_all_math_dtypes(device):
9648:            for input_dtype in get_all_math_dtypes(device):
9664:            for input_dtype in get_all_math_dtypes(device):
13015:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13034:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
13159:    dtypes(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17400:    dtypesIfCUDA(*get_all_fp_dtypes(include_bfloat16=AMPERE_OR_ROCM))
17768:    dtypesIfCUDA(*get_all_fp_dtypes())
17773:    dtypesIfCUDA(*get_all_fp_dtypes())
17778:    dtypesIfCUDA(*get_all_fp_dtypes())
17783:    dtypesIfCUDA(*get_all_fp_dtypes())
17788:    dtypesIfCUDA(*get_all_fp_dtypes())
17793:    dtypesIfCUDA(*get_all_fp_dtypes())
17798:    dtypesIfCUDA(*get_all_fp_dtypes())
17963:    dtypesIfCUDA(*get_all_fp_dtypes())
17977:    dtypesIfCUDA(*get_all_fp_dtypes())
18684:    def test_cross_entropy_loss_prob_target_all_reductions(self, device):
```

</p>
</details>

<details>
<summary>

`test/test_numpy_interop.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import get_all_dtypes
399:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_ops.py`</summary>

<p>

```python
12:from torch.testing._internal.common_dtype import floating_and_complex_types_and, get_all_dtypes
86:        for dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_reductions.py`</summary>

<p>

```python
16:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_complex_dtypes, get_all_fp_dtypes,
360:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
366:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
394:         allowed_dtypes=get_all_dtypes(include_bfloat16=False))
750:        for dtype in [dtype for dtype in get_all_math_dtypes('cpu') if dtype != torch.float16]:
1404:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1457:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1458:              get_all_complex_dtypes()))
1465:            return dtype in get_all_int_dtypes()
1494:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1501:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1507:    dtypes(*(get_all_complex_dtypes()))
1514:        dtypes = list(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False))
1523:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False)))
1531:        if dtype in get_all_fp_dtypes():
1608:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
1837:    dtypes(*get_all_dtypes(include_bool=False, include_complex=False))
1855:    dtypes(*(set(get_all_dtypes(include_bool=False, include_complex=False)) - {torch.uint8}))
3219:        for dtype in get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_serialization.py`</summary>

<p>

```python
26:from torch.testing._internal.common_dtype import get_all_dtypes
586:        for device, dtype in product(devices, get_all_dtypes()):
589:            for other_dtype in get_all_dtypes():
```

</p>
</details>

<details>
<summary>

`test/test_shape_ops.py`</summary>

<p>

```python
18:from torch.testing._internal.common_dtype import get_all_dtypes
230:    dtypes(*get_all_dtypes(include_complex=False, include_bool=False, include_half=False,
232:    dtypesIfCUDA(*get_all_dtypes(include_complex=False, include_bool=False, include_bfloat16=False))
344:    dtypes(*get_all_dtypes())
443:    dtypes(*get_all_dtypes())
461:    dtypes(*get_all_dtypes())
570:    dtypes(*get_all_dtypes(include_complex=False))
```

</p>
</details>

<details>
<summary>

`test/test_sort_and_select.py`</summary>

<p>

```python
12:    all_types, all_types_and, floating_types_and, get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes,
136:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
231:    dtypes(*set(get_all_dtypes()) - {torch.bool, torch.complex64, torch.complex128})
296:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
647:    dtypesIfCUDA(*get_all_fp_dtypes())
678:    dtypesIfCUDA(*(get_all_dtypes(include_complex=False,
682:    dtypes(*(get_all_dtypes(include_complex=False, include_bool=False, include_half=False, include_bfloat16=False)))
739:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
740:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
799:    dtypesIfCPU(*set(get_all_dtypes()) - {torch.complex64, torch.complex128})
800:    dtypes(*set(get_all_dtypes()) - {torch.bfloat16, torch.complex64, torch.complex128})
```

</p>
</details>

<details>
<summary>

`test/test_sparse.py`</summary>

<p>

```python
20:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes
29:    floating_and_complex_types, floating_and_complex_types_and, get_all_dtypes, get_all_int_dtypes,
1963:            return dtype in get_all_int_dtypes()
1994:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2103:            return dtype in get_all_int_dtypes()
2138:    dtypes(*get_all_dtypes(include_bool=False, include_half=False,
2626:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
2633:        all_sparse_dtypes = get_all_dtypes(include_complex=True)
3230:    dtypes(*get_all_complex_dtypes(),
3231:            *get_all_fp_dtypes(include_half=False, include_bfloat16=False))
3234:                  *get_all_fp_dtypes(
```

</p>
</details>

<details>
<summary>

`test/test_sparse_csr.py`</summary>

<p>

```python
7:from torch.testing import get_all_complex_dtypes, get_all_fp_dtypes, floating_and_complex_types, make_tensor
17:from torch.testing._internal.common_dtype import floating_types, get_all_dtypes
120:    dtypes(*get_all_dtypes())
133:    dtypes(*get_all_dtypes())
150:    dtypes(*get_all_dtypes())
180:    dtypes(*get_all_dtypes())
201:    dtypes(*get_all_dtypes())
210:    dtypes(*get_all_dtypes())
225:    dtypes(*get_all_dtypes())
244:    dtypes(*get_all_dtypes())
263:    dtypes(*get_all_dtypes())
285:    dtypes(*get_all_dtypes())
411:    dtypes(*get_all_dtypes())
482:    dtypes(*get_all_dtypes())
502:    dtypes(*get_all_dtypes())
562:    dtypes(*get_all_dtypes())
588:    dtypesIfCUDA(*get_all_complex_dtypes(),
589:                  *get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
745:    dtypesIfCUDA(*get_all_complex_dtypes(),
746:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
765:    dtypesIfCUDA(*get_all_complex_dtypes(),
766:                  *get_all_fp_dtypes(include_half=SM53OrLater and TEST_CUSPARSE_GENERIC,
801:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
841:                  *torch.testing.get_all_fp_dtypes(include_bfloat16=SM80OrLater,
1182:    dtypes(*get_all_dtypes())
1276:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_bfloat16=False))
1286:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_tensor_creation_ops.py`</summary>

<p>

```python
21:    onlyCUDA, skipCPUIf, dtypesIfCUDA, skipMeta, get_all_device_types)
23:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
150:        for dt in get_all_dtypes():
160:        for dt in get_all_dtypes():
314:        dtypes = [dtype for dtype in get_all_dtypes() if dtype != torch.bfloat16]
1012:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1013:              get_all_complex_dtypes()))
1032:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1033:              get_all_complex_dtypes()))
1050:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1051:              get_all_complex_dtypes()))
1745:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1779:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1868:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1926:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
1954:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
1956:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, None)
1957:            do_test_empty_full(self, get_all_math_dtypes('cpu'), torch.strided, torch_device)
2538:        for device in get_all_device_types():
2645:        for dtype in get_all_dtypes():
2678:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False) +
2679:              get_all_complex_dtypes()))
2716:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
2827:            for dt in get_all_dtypes():
2913:    dtypes(*get_all_dtypes(include_bool=False, include_half=False))
2914:    dtypesIfCUDA(*get_all_dtypes(include_bool=False, include_half=True))
3028:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3033:    dtypes(*(get_all_fp_dtypes() + get_all_complex_dtypes()))
3074:    dtypes(*get_all_dtypes(include_bool=False, include_half=False, include_complex=False))
3075:    dtypesIfCUDA(*((get_all_int_dtypes() + [torch.float32, torch.float16, torch.bfloat16])
3077:                    else get_all_dtypes(include_bool=False, include_half=True, include_complex=False)))
3873:    dtypes(*get_all_dtypes())
3884:    dtypes(*get_all_dtypes(include_bool=False))
3916:            for other in get_all_dtypes():
3922:    dtypes(*get_all_dtypes())
3932:    dtypes(*get_all_dtypes(include_bool=False))
3955:    dtypes(*get_all_dtypes(include_bool=False))
3961:    dtypes(*get_all_dtypes(include_bool=False))
3965:    dtypes(*get_all_dtypes())
```

</p>
</details>

<details>
<summary>

`test/test_testing.py`</summary>

<p>

```python
25:from torch.testing._internal.common_dtype import get_all_dtypes
31:    dtypes(*(get_all_dtypes(include_half=True, include_bfloat16=False,
```

</p>
</details>

<details>
<summary>

`test/test_torch.py`</summary>

<p>

```python
51:    expectedAlertNondeterministic, get_all_device_types, skipXLA)
57:    get_all_fp_dtypes, get_all_int_dtypes, get_all_math_dtypes, get_all_dtypes, get_all_complex_dtypes
296:            for d in get_all_device_types():
323:            for device in get_all_device_types():
324:                for dt1 in get_all_dtypes():
325:                    for dt2 in get_all_dtypes():
343:            all_dtypes = get_all_dtypes()
350:            all_dtypes = get_all_dtypes()
781:            for dtype in get_all_dtypes():
986:            for device in get_all_device_types():
1017:            for device in get_all_device_types():
1018:                for dtype in get_all_math_dtypes(device):
2792:            for device in get_all_device_types():
3186:    dtypes(*get_all_dtypes())
3195:        for error_dtype in get_all_dtypes():
3203:    dtypes(*get_all_dtypes())
3212:        for error_dtype in get_all_dtypes():
4539:    dtypes(*get_all_fp_dtypes())
4545:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
4577:    dtypes(*get_all_fp_dtypes(include_half=False, include_bfloat16=False))
4578:    dtypesIfCPU(*(get_all_fp_dtypes(include_half=False, include_bfloat16=True)))
4579:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4599:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4600:    dtypesIfCPU(*(get_all_dtypes(include_half=False, include_bfloat16=False, include_complex=False)))
4601:    dtypesIfCUDA(*(get_all_dtypes(include_bfloat16=False, include_complex=False)))
4613:        for p_dtype in get_all_fp_dtypes(include_half=device.startswith('cuda'), include_bfloat16=False):
4628:    dtypes(*(get_all_fp_dtypes(include_half=False, include_bfloat16=False)))
4629:    dtypesIfCUDA(*(get_all_fp_dtypes(include_bfloat16=False)))
4640:    dtypes(*get_all_fp_dtypes())
4723:    dtypes(*get_all_fp_dtypes())
4735:    dtypes(*get_all_fp_dtypes(include_bfloat16=False))
4736:    dtypesIfCUDA(*get_all_fp_dtypes())
4747:    dtypes(*get_all_fp_dtypes())
4761:    dtypes(*get_all_fp_dtypes())
4771:    dtypes(*get_all_fp_dtypes())
4792:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
5302:    dtypes(*get_all_dtypes(include_bfloat16=False))
5322:    dtypes(*get_all_dtypes(include_half=False, include_bfloat16=False))
5323:    dtypesIfCPU(*get_all_dtypes(include_bfloat16=False))
5324:    dtypesIfCUDA(*get_all_dtypes(include_bfloat16=False))
5591:        for dt in get_all_dtypes():
5611:        for dt in get_all_dtypes():
5678:        for dt in get_all_dtypes():
5696:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
5697:    dtypes(*set(get_all_math_dtypes('cpu')))
5746:    dtypes(*get_all_dtypes())
5780:    dtypes(*get_all_dtypes())
5885:    dtypes(*get_all_dtypes())
5902:    dtypes(*get_all_dtypes())
5945:    dtypes(*get_all_dtypes())
5979:    dtypes(*get_all_dtypes(include_bool=False))
6049:    dtypes(*get_all_dtypes(include_bool=False))
6092:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6093:              get_all_complex_dtypes()))
6094:    dtypesIfCPU(*get_all_dtypes())
6095:    dtypesIfCUDA(*get_all_dtypes())
6122:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6123:              get_all_complex_dtypes()))
6124:    dtypesIfCPU(*get_all_dtypes())
6125:    dtypesIfCUDA(*get_all_dtypes())
6163:    dtypes(*(get_all_fp_dtypes(include_bfloat16=False, include_half=False) +
6164:              get_all_complex_dtypes()))
6165:    dtypesIfCPU(*get_all_dtypes())
6166:    dtypesIfCUDA(*get_all_dtypes())
6190:    dtypes(*(get_all_complex_dtypes() +
6191:              get_all_int_dtypes()))
6238:    dtypes(*get_all_dtypes())
6323:    dtypes(*get_all_dtypes())
6389:    dtypes(*product(get_all_dtypes(), (torch.uint8, torch.bool)))
6699:    dtypesIfCUDA(*set(get_all_math_dtypes('cuda')))
6700:    dtypes(*set(get_all_math_dtypes('cpu')))
7452:    dtypes(*get_all_dtypes(include_bool=False))
7461:    dtypes(*get_all_dtypes(include_bool=False))
7477:    dtypes(*get_all_dtypes(include_bool=False))
7496:    dtypes(*get_all_dtypes(include_bool=False))
7538:    dtypes(*get_all_dtypes(include_bool=False))
8162:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8163:              get_all_complex_dtypes()))
8175:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes() +
8176:              get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_type_promotion.py`</summary>

<p>

```python
14:    get_all_dtypes, get_all_math_dtypes, get_all_int_dtypes, get_all_fp_dtypes
187:        for dtype in get_all_dtypes():
262:        dtypes1 = get_all_math_dtypes('cuda')
263:        dtypes2 = get_all_math_dtypes(device)
339:    dtypes(*itertools.product(get_all_dtypes(), get_all_dtypes()))
468:            for dt1 in get_all_math_dtypes(device):
469:                for dt2 in get_all_math_dtypes(device):
519:            for dt1 in get_all_math_dtypes(device):
520:                for dt2 in get_all_math_dtypes(device):
528:        for dt in get_all_math_dtypes(device):
561:        for dtype in get_all_dtypes():
766:                                          dtypes=get_all_math_dtypes(device))
771:                                          dtypes=get_all_math_dtypes(device))
782:                                          dtypes=get_all_math_dtypes(device))
879:        dtypes = get_all_dtypes(include_bfloat16=False)
898:        dtypes = get_all_dtypes(include_bfloat16=False, include_bool=False)
965:    dtypesIfCUDA(*itertools.product(get_all_dtypes(include_bfloat16=False, include_complex=False),
966:                                     get_all_dtypes(include_bfloat16=False, include_complex=False)))
967:    dtypes(*itertools.product(get_all_dtypes(include_half=False, include_bfloat16=False,
969:                               get_all_dtypes(include_half=False, include_bfloat16=False,
976:            return dtype in get_all_int_dtypes() + [torch.bool]
979:            return dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False)
```

</p>
</details>

<details>
<summary>

`test/test_unary_ufuncs.py`</summary>

<p>

```python
24:    floating_types_and, all_types_and_complex_and, floating_and_complex_types_and, get_all_dtypes, get_all_math_dtypes,
25:    get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
517:    dtypes(*(get_all_int_dtypes() + [torch.bool] +
518:              get_all_fp_dtypes(include_bfloat16=False)))
596:    dtypes(*get_all_fp_dtypes(include_half=True, include_bfloat16=False))
611:        invalid_input_dtypes = get_all_int_dtypes() + \
612:            get_all_complex_dtypes() + \
619:        for dtype in get_all_fp_dtypes(include_half=True, include_bfloat16=False):
1048:    dtypes(*get_all_math_dtypes('cpu'))
1182:    dtypesIfCUDA(*get_all_fp_dtypes())
1190:    dtypesIfCUDA(*get_all_fp_dtypes())
1205:    dtypesIfCUDA(*get_all_fp_dtypes())
1215:    dtypesIfCUDA(*get_all_fp_dtypes())
1307:    dtypes(*(get_all_dtypes(include_bool=False)))
1349:    dtypes(*(get_all_fp_dtypes(include_half=False) +
1350:              get_all_complex_dtypes()))
1351:    dtypesIfCUDA(*(get_all_fp_dtypes(include_half=True) +
1352:                    get_all_complex_dtypes()))
```

</p>
</details>

<details>
<summary>

`test/test_view_ops.py`</summary>

<p>

```python
19:    get_all_dtypes, get_all_int_dtypes, get_all_fp_dtypes, get_all_complex_dtypes
124:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
131:    dtypes(*get_all_dtypes(include_bfloat16=False))
213:            for view_dtype in [*get_all_fp_dtypes(), *get_all_complex_dtypes()]:
220:    dtypes(*get_all_dtypes())
224:        for view_dtype in get_all_dtypes():
305:    dtypes(*get_all_complex_dtypes(include_complex32=True))
343:    dtypes(*get_all_dtypes())
354:    dtypes(*get_all_dtypes())
364:    dtypes(*get_all_dtypes())
374:    dtypes(*get_all_dtypes())
384:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes()))
395:    dtypes(*get_all_complex_dtypes())
426:    dtypes(*get_all_complex_dtypes())
451:    dtypes(*product(get_all_complex_dtypes(), get_all_dtypes()))
1263:    dtypes(*(torch.testing.get_all_dtypes()))
1279:    dtypes(*(torch.testing.get_all_dtypes()))
1405:    dtypes(*(get_all_int_dtypes() + get_all_fp_dtypes(include_bfloat16=False) +
1406:              get_all_complex_dtypes()))
1471:    dtypes(*get_all_dtypes(include_bfloat16=False))
1574:    dtypes(*get_all_dtypes())
1601:    dtypes(*get_all_dtypes(include_bfloat16=False))
1632:    dtypes(*get_all_dtypes(include_bfloat16=False))
1711:        for dt in get_all_dtypes():
1717:        for dt in get_all_dtypes():
1724:        for dt in get_all_dtypes():
```

</p>
</details>

I'm looking forward to your viewpoints. Thanks :)

cc: mruberry kshitij12345 anjali411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71561

Reviewed By: samdow

Differential Revision: D34856571

Pulled By: mruberry

fbshipit-source-id: 0dca038bcad5cf69906245c496d2e61ac3876335
(cherry picked from commit b058f67b4313143efa714ab105f36e74083131b9)
2022-03-15 20:31:41 +00:00
Emilio Castillo
3186e366d1 Support 0s in out_size of FractionalMaxPoolNd
Fixes #73624

CUDA implementation was correct :), only CPU had an out of bounds memory access
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73634
Approved by: jbschlosser
2022-03-03 15:39:44 +00:00
soulitzer
e6afa4f771 batch_norm_jvp: improve error message when running_{mean,var} have forward grad defined (#73655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73655

Fixes: https://github.com/pytorch/pytorch/issues/73541

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D34586758

Pulled By: soulitzer

fbshipit-source-id: 689dba3ac159e50b596381c27e23ef1fd8122a40
(cherry picked from commit 81ea860fbe3c217b0100730f4b74e8d5f9bf1b61)
2022-03-02 21:31:29 +00:00
François Lecomte
1dd3f950ba Optimize grid sample 3d
Fixes #71415
I have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :

> Fixes #64977
>
> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).
>
> Brief description of the changes:
>
>     * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).
>
>     * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.
>
>     * Changed the CPU kernels:
>       (1) added `bool input_requires_grad` template parameter to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)
>
>     * Changed CUDA kernel:
>       (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,
>       (2) added if branches based on it to remove `input` gradient computations if it's not requested,
>       (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.
>
>     * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.
>
>     * Have not touched the CPU fallback kernel.

Note: the changes number (3) are N/A in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71759
2022-02-23 19:25:17 +00:00
Rui Zhu
f41db99a56 Add simple correctness check for native MHA (#72941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72941

Simple test for MHA, use cos similarity as metric since scaling generate mismatch. Cuda is validated, CPU fix a following (We can land this with onlyCuda flag, and remove it once CPU is also done)

Test Plan:
For cuda:
buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_native_multihead_attention_cuda_float32 2>&1 | pastry

Reviewed By: swolchok

Differential Revision: D33906921

fbshipit-source-id: ad447401eb7002f22ed533d620a6b544524b3f58
(cherry picked from commit 45b778da27)
2022-02-19 00:31:45 +00:00
Scott Wolchok
79a216ce57 Move native MHA code out of PyTorch core (#72944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72944

Doesn't make sense to develop it in core right now.
ghstack-source-id: 149456040

Test Plan:
CI

run MHA benchmark in benchmark_transformers.py to make sure it doesn't crash

Reviewed By: zrphercule

Differential Revision: D34283104

fbshipit-source-id: 4f0c7a6bc066f938ceac891320d4cf4c3f8a9cd6
(cherry picked from commit b9df65e97c)
2022-02-18 21:34:06 +00:00
zsef123
e0e1e0b114 Fix empty tensor handling in RReLU (#70496)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70489

Add handling if `numel == 0`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70496

Reviewed By: zou3519, cpuhrsch

Differential Revision: D34286069

Pulled By: jbschlosser

fbshipit-source-id: a63797fe1ea05e5a192bc8e43327949b36ceebf2
(cherry picked from commit b410abe85e)
2022-02-17 14:37:47 +00:00
Scott Wolchok
ae8198121c [PyTorch] Handle non-vectorizable parameters for native MHA CUDA rescale kernel (#72671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72671

The existing kernel did not handle cases where D % 4 != 0 or dim_per_head % 4 != 0. Now we have a non-vectorized kernel for these cases.
ghstack-source-id: 149201477

Test Plan: Updated test_nn to cover these cases.

Reviewed By: zrphercule, ngimel

Differential Revision: D34119371

fbshipit-source-id: 4e9b4d9b636224ef2c433593f6f236df040de782
(cherry picked from commit f5393878e4)
2022-02-16 18:33:31 +00:00
Scott Wolchok
ad623fdecf [PyTorch] MHA: add test for transform_bias_rescale_qkv (#72464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72464

We had some trouble getting this component (and this test!) right, so let's test it.
ghstack-source-id: 149201478

Test Plan: new test passes

Reviewed By: zrphercule

Differential Revision: D33992477

fbshipit-source-id: cc377eed5d4a4412b42bdabf360601c6e52947cf
(cherry picked from commit 9832867b12)
2022-02-16 18:11:56 +00:00
Eddie Yan
e7985e3c60 Properly initialize grad_weight in raw_cudnn_convolution_backward_weight_out (#72157)
Summary:
https://github.com/pytorch/pytorch/issues/71521 attempted to fix an issue where the `test_conv_large` test was producing `NaN` values after the backward pass, yielding a bogus comparison between the result and the expected result. While tweaking the initialization of the conv layer seemed to fix this behavior, it was actually just masking the real issue, which was that `grad_weight` is not guaranteed to be initialized in `raw_cudnn_convolution_backward_weight_out` when the backward operation is split.

Specifically, the `grad_weight` tensor is expected to be directly written to by a `cudnn` kernel (which does occur in most cases) so it does not need to be initialized, but splitting introduces an intermediate `grad_weight_` tensor that holds the intermediate gradients and then accumulates into `grad_weight` without initializing it first. This PR tweaks this behavior so that now accumulation is done with a zero'd tensor, and also adds the change of doing the accumulation in an accumulation dtype. The hacky workaround masking the issue is also reverted, with the safeguard against comparing `NaN` values (using the reference tensor for scale computation) kept in place.

CC ngimel ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72157

Reviewed By: malfet

Differential Revision: D34147547

Pulled By: ngimel

fbshipit-source-id: 056c19f727eeef96347db557528272e24eae4223
(cherry picked from commit 24c7f77a81)
2022-02-14 17:26:37 +00:00
Ryan Spring
4f8b986e28 Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: VitalyFedyunin

Differential Revision: D33894937

Pulled By: jbschlosser

fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851
(cherry picked from commit 6e986f91a9)
2022-02-14 03:40:32 +00:00
kshitij12345
a2e545e6c5 pad_sequence: fix regression - support tensor (#72436)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71365

Based on https://github.com/pytorch/pytorch/pull/72343

Thanks jbschlosser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72436

Reviewed By: bdhirsh

Differential Revision: D34117724

Pulled By: jbschlosser

fbshipit-source-id: e5d6599d0791025e18ab36ae16c417a91554bf64
(cherry picked from commit ffe8a0e41b)
2022-02-10 22:36:33 +00:00
Alban Desmaison
7035738b50 Change ParameterList and ParameterDict to be able to contain any kind of objects (#70499)
Summary:
The only difference with plain list/dict now is that nn.Parameters are
handled specially and registered as parameters properly.

test_nn and parametrization works locally.
Will see in CI if DP is fixed as well.

Tentative fix for https://github.com/pytorch/pytorch/issues/36035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70499

Reviewed By: jbschlosser, alexeib

Differential Revision: D34005332

Pulled By: albanD

fbshipit-source-id: 7e76b0873d0fec345cb537e2a6ecba0258e662b9
(cherry picked from commit dc1e6f8d86)
2022-02-09 18:52:29 +00:00
Horace He
7cdbbfaee2 Revert D33716716: [pytorch][PR] Added remove_duplicate parameter to nn.Module
Test Plan: revert-hammer

Differential Revision:
D33716716 (7e8217549f)

Original commit changeset: ff1ed9980bd1

Original Phabricator Diff: D33716716 (7e8217549f)

fbshipit-source-id: 91c3d9acc5bc731da716dd0d2485431f85f861c9
(cherry picked from commit c81d193bf0)
2022-02-03 09:04:29 +00:00
kshitij12345
02f6226bff [fix] Dropout2d-3d no-batch-dim (#69885)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69801

TODO:
* [x] Update C++ API

cc albanD mruberry jbschlosser walterddr kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69885

Reviewed By: mruberry

Differential Revision: D33175470

Pulled By: jbschlosser

fbshipit-source-id: c9d7d9e0f59ba290a0157725c338a345f3d58b9f
(cherry picked from commit 7e4271a156)
2022-02-02 16:40:32 +00:00
kshitij12345
aa5dab02b2 [fix] EmbeddingBag segfault for out-of-bounds idx (#71904)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71094

Added checks for out-of-bound indices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71904

Reviewed By: jbschlosser, VitalyFedyunin

Differential Revision: D33893387

Pulled By: ngimel

fbshipit-source-id: 0ba7038bd7e10c6fa6700646a0fe755b73db0ec9
(cherry picked from commit 4d6ae2e3f4)
2022-02-02 00:04:26 +00:00
pejato
b8a4ee5e35 Clean up old warnings in F.interpolate (#72093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71720

This PR removes the old warnings for `recompute_scale_factor` and `align_corners`.

Looking at this, I realize that the tests I modified don't really catch whether or not a warning is created for  `recompute_scale_factor`. If desired, I can add a couple lines into the tests there to pass a floating point in the `scale_factors` kwarg, along with `recompute_scale_factor=None`.

Let me know how this looks, thanks so much!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72093

Reviewed By: mruberry

Differential Revision: D33917615

Pulled By: albanD

fbshipit-source-id: e822f0a15b813ecf312cdc6ed0b693e7f1d1ca89
(cherry picked from commit c14852b85c)
2022-02-01 21:18:29 +00:00
Horace He
7e8217549f Added remove_duplicate parameter to nn.Module (#39)
Summary:
Pull Request resolved: https://github.com/pytorch/torchrec/pull/39

Pull Request resolved: https://github.com/facebookresearch/torchrec/pull/6

This makes it so that shared parameters get their own entry in `named_parameters`.

More broadly, this makes it so that
```
params_and_buffers = {**mod.named_named_parameters(remove_duplicate=False), **mod.named_buffers(remove_duplicate=False)}
_stateless.functional_call(mod, params_and_buffers, args, kwargs)
```
is identical to calling the original module's forwards pass.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71542

Reviewed By: jbschlosser, albanD

Differential Revision: D33716716

Pulled By: Chillee

fbshipit-source-id: ff1ed9980bd1a3f7ebaf695ee5e401202b543213
(cherry picked from commit d6e3ad3cd0)
2022-02-01 18:34:58 +00:00
Nikita Shulga
74c44ba9d6 Revert D33850228: [pytorch][PR] Implement Tanh Gelu Approximation
Test Plan: revert-hammer

Differential Revision:
D33850228 (23d03025dc)

Original commit changeset: 3cc33fb298e4

Original Phabricator Diff: D33850228 (23d03025dc)

fbshipit-source-id: 9436e7df73c2b2e2011f321674f24973316d3692
(cherry picked from commit c9efb58223)
2022-01-31 17:44:19 +00:00
Ryan Spring
23d03025dc Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: cpuhrsch

Differential Revision: D33850228

Pulled By: jbschlosser

fbshipit-source-id: 3cc33fb298e480d7ecc5c67716da019d60c6ab33
(cherry picked from commit 3a53b3e94f)
2022-01-31 17:07:45 +00:00
Jake Tae
ca61292465 Add append method for nn.Sequential (#71326)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/71249, and potentially supersedes https://github.com/pytorch/pytorch/pull/20274.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71326

Reviewed By: cpuhrsch

Differential Revision: D33855047

Pulled By: jbschlosser

fbshipit-source-id: a3a682e206f93b4c52bc3405e2f7b26aea6635ea
(cherry picked from commit c0b27bbf2a)
2022-01-31 16:54:12 +00:00
Joel Schlosser
cb823d9f07 Revert D33744717: [pytorch][PR] Implement Tanh Gelu Approximation
Test Plan: revert-hammer

Differential Revision:
D33744717 (f499ab9cef)

Original commit changeset: d64532a562ed

Original Phabricator Diff: D33744717 (f499ab9cef)

fbshipit-source-id: 396c3f63de5865f894dbc353d0790a01a624be93
(cherry picked from commit e9fb2d1db1)
2022-01-28 18:35:01 +00:00
Ryan Spring
f499ab9cef Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: mikaylagawarecki

Differential Revision: D33744717

Pulled By: jbschlosser

fbshipit-source-id: d64532a562ed53247bb4fa52bb16722634d5c187
(cherry picked from commit 4713dd9cca)
2022-01-28 16:59:09 +00:00
vfdev-5
eeda31fa08 Added antialias flag to interpolate (CUDA, bilinear and bicubic) (#70930)
Summary:
Description:
- Added antialias flag to interpolate (CUDA)
  - forward and backward for bicubic mode
  - added tests

Previous PR for CPU bilinear, https://github.com/pytorch/pytorch/pull/65142
Previous PR for CPU bicubic, https://github.com/pytorch/pytorch/pull/68819

### Benchmarks

<details>
<summary>
Bilinear forward pass, PIL, PTH CPU and PTH CUDA
</summary>

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```

Torch version: 1.11.0a0+gitd032369
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,

Num threads: 8
[----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               2851.2              |            874.1          |            57.1
      channels_last non-contiguous torch.float32  |               2856.1              |           1155.8          |           130.6

Times are in microseconds (us).

[----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               3705.9              |           1005.8          |            66.3
      channels_last non-contiguous torch.float32  |               3742.9              |           1332.8          |           143.5

Times are in microseconds (us).

[------------------------------------ Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               1768.0              |           725.2           |            77.9
      channels_last non-contiguous torch.float32  |               1753.7              |           942.5           |           144.0

Times are in microseconds (us).

[----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               9522.6              |           2593.8          |           157.8
      channels_last non-contiguous torch.float32  |               9513.5              |           3622.7          |           241.5

Times are in microseconds (us).

[----------------------------------- Downsampling (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               2240.1              |           565.5           |            93.3
      channels_last non-contiguous torch.float32  |               2244.2              |           972.7           |           170.8

Times are in microseconds (us).

[------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              1441.3             |           386.1           |            22.3

Times are in microseconds (us).

[------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              1815.2             |           376.8           |            27.8

Times are in microseconds (us).

[-------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) --------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              962.3              |           400.0           |            29.4

Times are in microseconds (us).

[------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) -------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              4749.7             |           910.1           |            63.7

Times are in microseconds (us).

[------------------------- Downsampling (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) -------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              1098.1             |           272.0           |            36.4

Times are in microseconds (us).

```

</details>

<details>
<summary>
Bicubic forward pass, PIL, PTH CPU and PTH CUDA
</summary>

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```

Torch version: 1.11.0a0+gitd032369
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,

Num threads: 8
[------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               4522.4              |           1406.7          |           170.3
      channels_last non-contiguous torch.float32  |               4530.0              |           1435.4          |           242.2

Times are in microseconds (us).

[------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               5726.4              |           1628.6          |           164.0
      channels_last non-contiguous torch.float32  |               5722.6              |           1665.6          |           234.7

Times are in microseconds (us).

[------------------------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) ------------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               2909.1              |           1461.5          |           276.9
      channels_last non-contiguous torch.float32  |               2892.9              |           1458.7          |           345.1

Times are in microseconds (us).

[----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |              14699.2              |           4283.9          |           407.1
      channels_last non-contiguous torch.float32  |              14711.3              |           4321.1          |           477.0

Times are in microseconds (us).

[----------------------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -----------------------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |               3467.0              |           980.0           |           339.2
      channels_last non-contiguous torch.float32  |               3465.2              |           982.3           |           407.8

Times are in microseconds (us).

[-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) --------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              2396.7             |           877.8           |            68.1

Times are in microseconds (us).

[-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) --------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              3068.2             |           777.3           |            64.7

Times are in microseconds (us).

[-------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              1540.2             |           829.3           |           100.4

Times are in microseconds (us).

[------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              7919.5             |           1467.8          |           151.6

Times are in microseconds (us).

[------------------------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------------------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ---------------------------------------------------------------------------------------------------------------
       contiguous torch.float32  |              1695.7             |           631.2           |           117.7

Times are in microseconds (us).

```

</details>

<details>
<summary>
Bilinear backward pass, PTH CPU and PTH CUDA
</summary>

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```
- Measure only backward op

Torch version: 1.11.0a0+gitd032369
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,

Num threads: 8
[------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (320, 196) ------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |           4686.8          |           215.7
      channels_last non-contiguous torch.float32  |           5101.1          |           220.5

Times are in microseconds (us).

[------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (460, 220) ------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |           6011.2          |           204.4
      channels_last non-contiguous torch.float32  |           6396.0          |           210.0

Times are in microseconds (us).

[------------- Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |           2035.6          |           250.2
      channels_last non-contiguous torch.float32  |           1589.6          |           252.5

Times are in microseconds (us).

[------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |          11392.5          |           256.5
      channels_last non-contiguous torch.float32  |          11640.2          |           263.9

Times are in microseconds (us).

[------------ Downsampling backward (bilinear): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |          11769.6          |           465.9
      channels_last non-contiguous torch.float32  |          12407.0          |           474.4

Times are in microseconds (us).

[---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (320, 196) ----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |           3931.0          |           133.3

Times are in microseconds (us).

[---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (460, 220) ----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |           5594.8          |           133.9

Times are in microseconds (us).

[---- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 96) -----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |           1272.6          |           133.0

Times are in microseconds (us).

[--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |          10618.1          |           134.0

Times are in microseconds (us).

[--- Downsampling backward (bilinear): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |          11082.2          |           154.6

Times are in microseconds (us).

```

</details>

<details>
<summary>
Bicubic backward pass, PTH CPU and PTH CUDA
</summary>

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```
- Measure only backward op

Torch version: 1.11.0a0+gitd032369
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,

Num threads: 8
[------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |           6791.2          |           618.9
      channels_last non-contiguous torch.float32  |           7125.2          |           622.9

Times are in microseconds (us).

[------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |           8806.2          |           600.3
      channels_last non-contiguous torch.float32  |           9167.6          |           607.5

Times are in microseconds (us).

[-------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) -------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |           3683.6          |           693.8
      channels_last non-contiguous torch.float32  |           3617.4          |           695.0

Times are in microseconds (us).

[------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |          17548.2          |           779.4
      channels_last non-contiguous torch.float32  |          17966.2          |           786.5

Times are in microseconds (us).

[------------- Downsampling backward (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------]
                                                  |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: ----------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |            28.4           |            1.6
      channels_last non-contiguous torch.float32  |            28.4           |            1.6

Times are in milliseconds (ms).

[---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) -----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |           6266.1          |           208.5

Times are in microseconds (us).

[---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) -----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |           8218.3          |           200.8

Times are in microseconds (us).

[----- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) -----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |           3458.9          |           231.9

Times are in microseconds (us).

[---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |          15729.3          |           261.6

Times are in microseconds (us).

[---- Downsampling backward (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ----]
                                 |  1.11.0a0+gitd032369 cpu  |  1.11.0a0+gitd032369 cuda
8 threads: -----------------------------------------------------------------------------
       contiguous torch.float32  |          26279.8          |           547.0

Times are in microseconds (us).

```

</details>

Code is moved from torchvision: https://github.com/pytorch/vision/pull/4211 and optimized

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70930

Reviewed By: zou3519

Differential Revision: D33817902

Pulled By: jbschlosser

fbshipit-source-id: d63a620f8972ff36b63841f0bc6c820466f58f69
(cherry picked from commit d358cfdb7d)
2022-01-27 20:43:08 +00:00
Khushi Agrawal
dfcbe059ec Obliviate ALL_TENSORTYPES and ALL_TENSORTYPES2. (#71153)
Summary:
Hi,
The PR fixes https://github.com/pytorch/pytorch/issues/71096.  It aims to scan all the test files and replace ` ALL_TENSORTYPES` and `ALL_TENSORTYPES2` with `get_all_fp_dtypes`.

I'm looking forward to your viewpoints!

Thanks!

cc: janeyx99 kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71153

Reviewed By: jbschlosser, mruberry

Differential Revision: D33533346

Pulled By: anjali411

fbshipit-source-id: 75e79ca2756c1ddaf0e7e0289257fca183a570b3
(cherry picked from commit da54b54dc5)
2022-01-26 03:25:02 +00:00
eqy
166d4e4201 Change test_conv_large parameter initialization (#71521)
Summary:
This PR twiddles the parameters of the conv layer in `test_conv_large` to better avoid NaN values. Previously, this test would cause a NaN to be computed for `scale` (propagated from `.mean()` on the `.grad` tensor). This NaN would then be propagated to the scaled gradients via division, resulting in a bogus `assertEqual` check as `NaN == NaN` is by default true. (This behavior was observed on V100 and A100).
To improve visibility of failures in the event of NaNs in `grad1`, scale is now computed from `grad2`.

Interestingly enough, we discovered this issue when trying out some less common setups that broke this test; it turns out those breakages were cases where there were no NaN values (leading to an actual `assertEqual` check that would fail for `float16`).

CC ptrblck ngimel puririshi98

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71521

Reviewed By: anjali411

Differential Revision: D33776705

Pulled By: ngimel

fbshipit-source-id: a1ec4792cba04c6322b22ef5b80ce08579ea4cf6
(cherry picked from commit d207bd9b87)
2022-01-26 02:32:15 +00:00
Emilio Castillo
6848e0dae5 Fix RNN modules with inputs shapes containing-0 in CUDA (#71696)
Summary:
We found a discrepancy between cpu & CUDA when using RNN modules where input shapes containing 0s would cause an invalid configuration argument error in CUDA (kernel grid size is 0), while returning a valid tensor in CPU cases.

A reproducer:

```
import torch

x = torch.zeros((5, 0, 3)).cuda()
gru = torch.nn.GRU(input_size=3, hidden_size=4).to("cuda")
gru(x)
```

Run with `CUDA_LAUNCH_BLOCKING=1` set.

cc ngimel albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71696

Reviewed By: mikaylagawarecki

Differential Revision: D33743674

Pulled By: ngimel

fbshipit-source-id: e9334175d10969fdf1f9c63985910d944bbd26e7
(cherry picked from commit 70838ba69b)
2022-01-25 18:32:13 +00:00
kshitij12345
0a2cdd18f3 nice error msg from load_state_dict for non-tensor value (#70596)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70596

Reviewed By: anjali411

Differential Revision: D33710750

Pulled By: jbschlosser

fbshipit-source-id: 870b5fafffcd005fd4fcd62f865542739c133805
(cherry picked from commit da374fbc58)
2022-01-21 22:02:13 +00:00
mingfeima
84b1c9798c add BFloat16 support for AvgPool2d on CPU (#66927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66927

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D33353198

Pulled By: VitalyFedyunin

fbshipit-source-id: 1aeaa4bb90ac99210b8f6051c09d6995d06ce3a1
2022-01-14 07:59:10 -08:00
mingfeima
910c01020e add BFloat16 support for AdaptiveMaxPool2d on CPU (#66929)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66929

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D33353199

Pulled By: VitalyFedyunin

fbshipit-source-id: d402d5deb7ca766259ca42118ddc16625e134c4c
2022-01-13 20:00:42 -08:00
Jake Tae
eac3decf93 ModuleList concatenation (#70887)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70441.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70887

Reviewed By: ejguan

Differential Revision: D33555431

Pulled By: albanD

fbshipit-source-id: ce42459ee46a611e98e89f02686acbac16b6b668
2022-01-13 15:31:07 -08:00
mingfeima
385773cb77 add BFloat16 support for MaxPool2d on CPU (#56903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56903

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D28836791

Pulled By: VitalyFedyunin

fbshipit-source-id: e03d55cc30dfa3628f096938fbad34b1031948af
2022-01-12 14:20:20 -08:00
Ilya Persky
a8612cd72a Skip failing tests in test_nn if compiled without LAPACK (#70913)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70913

Reviewed By: mruberry

Differential Revision: D33534840

Pulled By: albanD

fbshipit-source-id: 0facf5682140ecd7a78edb34b9cd997f9319e084
2022-01-11 12:21:18 -08:00
George Qi
d7db5fb462 ctc loss no batch dim support (#70092)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70092

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33280068

Pulled By: george-qi

fbshipit-source-id: 3278fb2d745a396fe27c00fb5f40df0e7f584f81
2022-01-07 14:33:22 -08:00
Bin Bao
f135438d3b Dispatch to at::convolution intead of at::_convolution in _convolution_double_backward (#70661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70661

Dispatching to at::convolution can make Lazy Tensor trace the right convolution op.

Test Plan: pytest test/test_nn.py -k test_conv_double_backward_strided_with_3D_input_and_weight

Reviewed By: wconstab, jbschlosser

Differential Revision: D33428780

Pulled By: desertfire

fbshipit-source-id: 899e4135588ea33fff23d16103c25d9bcd3f902c
2022-01-07 07:53:46 -08:00
Joel Schlosser
e6befbe85c Add flag to optionally average output attention weights across heads (#70055)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70055

Reviewed By: bhosmer

Differential Revision: D33457866

Pulled By: jbschlosser

fbshipit-source-id: 17746b3668b0148c1e1ed8333227b7c42f1e3bf5
2022-01-06 17:32:37 -08:00
Joel Schlosser
7b8f73dd32 No-batch-dim support for ConvNd (#70506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70506

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33355034

Pulled By: jbschlosser

fbshipit-source-id: 5a42645299b1d82cee7d461826acca1c5b35a71c
2022-01-06 16:53:50 -08:00
Jake Tae
b7742b437a Allow RNN hidden_size to be 0 (#70556)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56767.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70556

Reviewed By: ngimel

Differential Revision: D33455156

Pulled By: jbschlosser

fbshipit-source-id: 5dc57b09d7beb6ae81dfabc318e87c109bb4e6ae
2022-01-06 14:18:36 -08:00
Jane Xu
c00d33033c Remove repeat test for types in test nn (#70872)
Summary:
Helps fix a part of https://github.com/pytorch/pytorch/issues/69865

The first commit just migrates everything as is.

The second commit uses the "device" variable instead of passing "cuda" everywhere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70872

Reviewed By: jbschlosser

Differential Revision: D33455941

Pulled By: janeyx99

fbshipit-source-id: 9d9ec8c95f1714c40d55800e652ccd69b0c314dc
2022-01-06 09:20:02 -08:00
soulitzer
3051aabd0e Add forward AD formulas for convolution and some others (#69956)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69956

Test Plan: Imported from OSS

Reviewed By: albanD, bdhirsh

Differential Revision: D33235974

Pulled By: soulitzer

fbshipit-source-id: ea60d687edc5d62d92f3fd3cb6640421d32c908c
2022-01-06 08:39:51 -08:00
Joel Schlosser
b60b1b100f Set cuDNN deterministic flag for test_conv_double_backward_cuda (#69941)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69941

Reviewed By: george-qi

Differential Revision: D33430727

Pulled By: jbschlosser

fbshipit-source-id: 4a250bd0e5460ee631730afe0ab68ba72f37d292
2022-01-05 10:05:56 -08:00
kshitij12345
7bfaa230be [nn] adaptive_avg_pool{1/2/3}d : Error on negative output_size (#70488)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70488

Reviewed By: H-Huang

Differential Revision: D33367289

Pulled By: jbschlosser

fbshipit-source-id: 6b7b89d72c4e1e049ad6a0addb22a261c28ddb4c
2021-12-30 14:42:11 -08:00
mingfeima
401a6b682b add BFloat16 support for AdaptiveAvgPool2d on CPU (#56902)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56902

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D28836789

Pulled By: VitalyFedyunin

fbshipit-source-id: caac5e5b15190b8010bbfbc6920aa44032208ee7
2021-12-30 11:58:37 -08:00
vfdev
d2abf3f981 Added antialias flag to interpolate (CPU only, bicubic) (#68819)
Summary:
Description:
- Added antialias flag to interpolate (CPU only)
  - forward and backward for bicubic mode
  - added tests

Previous PR for bilinear, https://github.com/pytorch/pytorch/pull/65142

### Benchmarks

<details>
<summary>
Forward pass, CPU. PTH interpolation vs PIL
</summary>

Cases:
- PTH RGB 3 Channels, float32 vs PIL RGB uint8 (apples vs pears)
- PTH 1 Channel, float32 vs PIL 1 Channel Float

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```
Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_61,code=sm_61
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=OFF,

Num threads: 1
[------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (320, 196) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                4.5                |          5.2
      channels_last non-contiguous torch.float32  |                4.5                |          5.3

Times are in milliseconds (ms).

[------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (460, 220) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                5.7                |          6.4
      channels_last non-contiguous torch.float32  |                5.7                |          6.4

Times are in milliseconds (ms).

[------------------- Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 96) --------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                3.0                |          4.0
      channels_last non-contiguous torch.float32  |                2.9                |          4.1

Times are in milliseconds (ms).

[------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (1200, 196) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                14.7               |          17.1
      channels_last non-contiguous torch.float32  |                14.8               |          17.2

Times are in milliseconds (ms).

[------------------ Downsampling (bicubic): torch.Size([1, 3, 906, 438]) -> (120, 1200) -------------------]
                                                  |  Reference, PIL 8.4.0, mode: RGB  |  1.11.0a0+gitb0bdf58
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                3.5                |          3.9
      channels_last non-contiguous torch.float32  |                3.5                |          3.9

Times are in milliseconds (ms).

[---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (320, 196) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               2.4               |          1.8

Times are in milliseconds (ms).

[---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (460, 220) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               3.1               |          2.2

Times are in milliseconds (ms).

[---------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 96) ----------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.6               |          1.4

Times are in milliseconds (ms).

[--------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (1200, 196) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               7.9               |          5.7

Times are in milliseconds (ms).

[--------- Downsampling (bicubic): torch.Size([1, 1, 906, 438]) -> (120, 1200) ---------]
                                 |  Reference, PIL 8.4.0, mode: F  |  1.11.0a0+gitb0bdf58
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.7               |          1.3

Times are in milliseconds (ms).

```

</details>

Code is moved from torchvision: https://github.com/pytorch/vision/pull/3810 and https://github.com/pytorch/vision/pull/4208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68819

Reviewed By: mikaylagawarecki

Differential Revision: D33339117

Pulled By: jbschlosser

fbshipit-source-id: 6a0443bbba5439f52c7dbc1be819b75634cf67c4
2021-12-29 14:04:43 -08:00
George Qi
8af39b7668 AdaptiveLogSoftmaxWithLoss no_batch_dim support (#69054)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69054

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D33200166

Pulled By: george-qi

fbshipit-source-id: 9d953744351a25f372418d2a64e8402356d1e9b7
2021-12-29 10:25:26 -08:00
soulitzer
3116d87024 Add forward AD formulas for {adaptive_,fractional_,}max_pool{2,3}d_{backward,} (#69884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69884

Also fixes: https://github.com/pytorch/pytorch/issues/69322, https://github.com/pytorch/pytorch/issues/69325

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D33093039

Pulled By: soulitzer

fbshipit-source-id: b9a522a00f4e9e85974888de5058de07280f8f66
2021-12-23 15:51:09 -08:00
soulitzer
5651e1e3ad Add auto_linear formulas and some others (#69727)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69727

Still need to test the backward ones. We would need to update gradgradcheck to check forward over backward.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33031728

Pulled By: soulitzer

fbshipit-source-id: 86c59df5d2196b5c8dbbb1efed9321e02ab46d30
2021-12-20 12:15:25 -08:00
Albert Liang
0d06616c47 Add dict methods to ParameterDict (#69403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68476

We implemented all of the following `dict` methods for `ParameterDict`
- `get `
- `setdefault`
- `popitem`
- `fromkeys`
- `copy`
- `__or__`
- `__ior__`
- `__reversed__`
- `__ror__`

The behavior of these new methods matches the expected behavior of python `dict` as defined by the language itself: https://docs.python.org/3/library/stdtypes.html#typesmapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69403

Reviewed By: albanD

Differential Revision: D33187111

Pulled By: jbschlosser

fbshipit-source-id: ecaa493837dbc9d8566ddbb113b898997e2debcb
2021-12-17 10:15:47 -08:00
Rui Zhu
46ace4ac33 Add support for masked_softmax when softmax_elements > 1024 & corresponding unit tests (#69924)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69924

Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax

Reviewed By: ngimel

Differential Revision: D32819181

fbshipit-source-id: 6838a11d3554ec8e1bd48f1c2c7b1ee3a4680995
2021-12-15 16:44:15 -08:00
kshitij12345
e8d5c7cf7f [nn] mha : no-batch-dim support (python) (#67176)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

* [x] Update docs
* [x] Tests for shape checking

Tests take roughly 20s on system that I use. Below is the timings for slowest 20 tests.

```
pytest test/test_modules.py -k _multih --durations=20
============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/kshiteej/Pytorch/pytorch_no_batch_mha, configfile: pytest.ini
plugins: hypothesis-6.23.2, repeat-0.9.1
collected 372 items / 336 deselected / 36 selected

test/test_modules.py ..............ssssssss..............                                                                                                                                                  [100%]

================================================================================================ warnings summary ================================================================================================
../../.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:73
test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float32
  /home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.10/site-packages/torch/backends/cudnn/__init__.py:73: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ==============================================================================================
8.66s call     test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_MultiheadAttention_cuda_float64
2.02s call     test/test_modules.py::TestModuleCPU::test_gradgrad_nn_MultiheadAttention_cpu_float64
1.89s call     test/test_modules.py::TestModuleCUDA::test_grad_nn_MultiheadAttention_cuda_float64
1.01s call     test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float32
0.51s call     test/test_modules.py::TestModuleCPU::test_grad_nn_MultiheadAttention_cpu_float64
0.46s call     test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiheadAttention_cuda_float32
0.45s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiheadAttention_cuda_float64
0.44s call     test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_MultiheadAttention_cuda_float32
0.21s call     test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_cuda_float64
0.21s call     test/test_modules.py::TestModuleCUDA::test_pickle_nn_MultiheadAttention_cuda_float32
0.18s call     test/test_modules.py::TestModuleCUDA::test_forward_nn_MultiheadAttention_cuda_float64
0.17s call     test/test_modules.py::TestModuleCPU::test_non_contiguous_tensors_nn_MultiheadAttention_cpu_float32
0.16s call     test/test_modules.py::TestModuleCPU::test_non_contiguous_tensors_nn_MultiheadAttention_cpu_float64
0.11s call     test/test_modules.py::TestModuleCUDA::test_factory_kwargs_nn_MultiheadAttention_cuda_float64
0.08s call     test/test_modules.py::TestModuleCPU::test_pickle_nn_MultiheadAttention_cpu_float32
0.08s call     test/test_modules.py::TestModuleCPU::test_pickle_nn_MultiheadAttention_cpu_float64
0.06s call     test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiheadAttention_cuda_float64
0.06s call     test/test_modules.py::TestModuleCUDA::test_repr_nn_MultiheadAttention_cuda_float32
0.06s call     test/test_modules.py::TestModuleCPU::test_forward_nn_MultiheadAttention_cpu_float32
0.06s call     test/test_modules.py::TestModuleCPU::test_forward_nn_MultiheadAttention_cpu_float64
============================================================================================ short test summary info =============================================================================================
=========================================================================== 28 passed, 8 skipped, 336 deselected, 2 warnings in 19.71s ===========================================================================
```

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67176

Reviewed By: dagitses

Differential Revision: D33094285

Pulled By: jbschlosser

fbshipit-source-id: 0dd08261b8a457bf8bad5c7f3f6ded14b0beaf0d
2021-12-14 13:21:21 -08:00
Rui Zhu
1a299d8f1b Add support for transformer layout of masked_softmax (#69272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69272

In transformer encoder and MHA, masked_softmax's mask is a 2D tensor (B, D), where input is a 4D tensor (B, H, D, D).
This mask could be simply broadcasted to a (B, H, D, D) like input, and then do a regular masked_softmax, however it will bring the problem of non-contiguous mask & consume more memory.
In this diff, we maintained mask's shape unchanged, while calc the corresponding mask for input in each cuda thread.

This new layout is not currently supported in CPU yet.

Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax

Reviewed By: ngimel

Differential Revision: D32605557

fbshipit-source-id: ef37f86981fdb2fb264d776f0e581841de5d68d2
2021-12-14 10:51:58 -08:00
Joel Schlosser
fc37e5b3ed Hook up general convolution to convolution_backward (#69584)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69584

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32936380

Pulled By: jbschlosser

fbshipit-source-id: c6fdd88db33bd1a9d0eabea47ae09a4d5b170e92
2021-12-12 17:30:01 -08:00
Joel Schlosser
f0e98dcbd3 General convolution_backward function (#69044)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69044

Test Plan: Imported from OSS

Reviewed By: zou3519, albanD, H-Huang

Differential Revision: D32708818

Pulled By: jbschlosser

fbshipit-source-id: e563baa3197811d8d51553fc83718ace2f8d1b7a
2021-12-12 15:53:38 -08:00
Rui Zhu
aab67c6dff Add native masked_softmax (#69268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69268

This diff enabled native masked softmax on CUDA, also expanded our current warp_softmax to accept masking.
The mask in this masked softmax has to be the same shape as input, and has to be contiguous.

In a following diff I will submit later, I will have encoder mask layout included, where input is BHDD and mask is BD.

Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax

Reviewed By: ngimel

Differential Revision: D32338419

fbshipit-source-id: 48c3fde793ad4535725d9dae712db42e2bdb8a49
2021-12-09 23:29:45 -08:00
kshitij12345
7407e3d6fd [fix] cross_entropy : fix weight with ignore_index and label_smoothing (#69511)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69339

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69511

Reviewed By: mrshenli

Differential Revision: D32951935

Pulled By: jbschlosser

fbshipit-source-id: 482eae851861a32f96bd6231dd3448fb6d44a015
2021-12-08 12:08:33 -08:00
jjsjann123
3c1e2ff9eb fixing layer_norm cuda bug (#69210)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/69208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69210

Reviewed By: H-Huang

Differential Revision: D32764811

Pulled By: ngimel

fbshipit-source-id: fb4201fe5f2284fcb22e36bc1029eef4a21b09bf
2021-12-01 15:46:47 -08:00
Kurt Mohler
d507fd63f3 Check that block height and width are positive in nn.Fold (#69048)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68875

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69048

Reviewed By: samdow

Differential Revision: D32729307

Pulled By: jbschlosser

fbshipit-source-id: 162cafb005873012d900d86997d07640967038c0
2021-12-01 10:08:47 -08:00
Omkar Salpekar
8e343ba5db Revert D32611368: [pytorch][PR] Initial version of general convolution_backward
Test Plan: revert-hammer

Differential Revision:
D32611368 (445b31abff)

Original commit changeset: 26d759b7c908

fbshipit-source-id: e91f45f0f31150e60d657a3964b7e42027beff58
2021-11-23 13:39:36 -08:00
Joel Schlosser
445b31abff Initial version of general convolution_backward (#65219)
Summary:
Towards [convolution consolidation](https://fb.quip.com/tpDsAYtO15PO).

Introduces the general `convolution_backward` function that uses the factored-out backend routing logic from the forward function.

Some notes:
* `finput` is now recomputed in the backward pass for the slow 2d / 3d kernels instead of being saved from the forward pass. The logic for is based on the forward computation and is present in `compute_finput2d` / `compute_finput3d` functions in `ConvUtils.h`.
* Using structured kernels for `convolution_backward` requires extra copying since the backend-specific backward functions return tensors. Porting to structured is left as future work.
* The tests that check the routing logic have been renamed from `test_conv_backend_selection` -> `test_conv_backend` and now also include gradcheck validation using an `autograd.Function` hooking up `convolution` to `convolution_backward`. This was done to ensure that gradcheck passes for the same set of inputs / backends.

The forward pass routing is done as shown in this flowchart (probably need to download it for it to be readable since it's ridiculous):
![conv_routing_graph md](https://user-images.githubusercontent.com/75754324/137186002-5bca75ca-f911-4e61-8245-ec07af841506.png)

![conv_nogroup_routing_graph md](https://user-images.githubusercontent.com/75754324/139731619-9d0d436e-cce3-4bc3-8eaf-d469f667f0d7.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65219

Reviewed By: mruberry

Differential Revision: D32611368

Pulled By: jbschlosser

fbshipit-source-id: 26d759b7c908ab8f19ecce627acea7bd3d5f59ba
2021-11-23 08:19:45 -08:00
soulitzer
7bb401a4c9 Add forward AD support for miscellanous operators (#67820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67820

Original PR here: https://github.com/pytorch/pytorch/pull/67040

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D32314423

Pulled By: soulitzer

fbshipit-source-id: ecd898dc903692cab084f6922a1d86986f957b1b
2021-11-19 14:31:06 -08:00
jiej
ca92111758 Add native_dropout (#63937)
Summary:
Adds native_dropout to have a reasonable target for torchscript in auto diff. native_dropout has scale and train as arguments in its signature, this makes native_dropout more consistent with other operators and removes conditionals in the autodiff definition.

cc gmagogsfm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63937

Reviewed By: mruberry

Differential Revision: D32477657

Pulled By: ngimel

fbshipit-source-id: d37b137a37acafa50990f60c77f5cea2818454e4
2021-11-18 19:41:10 -08:00
kshitij12345
d5d2096dab [testing] make @dtypes mandatory when using @dtypesIf (#68186)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53647

With this if a test forgets to add `dtypes` while using `dtypesIf`, following error is raised
```
AssertionError: dtypes is mandatory when using dtypesIf however 'test_exponential_no_zero' didn't specify it
```

**Tested Locally**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68186

Reviewed By: VitalyFedyunin

Differential Revision: D32468581

Pulled By: mruberry

fbshipit-source-id: 805e0855f988b77a5d8d4cd52b31426c04c2200b
2021-11-18 08:29:31 -08:00
vfdev-5
3da2e09c9b Added antialias flag to interpolate (CPU only, bilinear) (#65142)
Summary:
Description:
- Added antialias flag to interpolate (CPU only)
  - forward and backward for bilinear mode
  - added tests

### Benchmarks

<details>
<summary>
Forward pass, CPU. PTH interpolation vs PIL
</summary>

Cases:
- PTH RGB 3 Channels, float32 vs PIL RGB uint8 (apply vs pears)
- PTH 1 Channel, float32 vs PIL 1 Channel Float

Code: https://gist.github.com/vfdev-5/b173761a567f2283b3c649c3c0574112

```
# OMP_NUM_THREADS=1 python bench_interp_aa_vs_pillow.py

Torch config: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_75,code=sm_75
  - CuDNN 8.0.5
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=1, USE_CUDNN=1, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=0, USE_OPENMP=ON,

Num threads: 1
[------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (320, 196) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                2.9                |          3.1
      channels_last non-contiguous torch.float32  |                2.6                |          3.6

Times are in milliseconds (ms).

[------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (460, 220) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                3.4                |          4.0
      channels_last non-contiguous torch.float32  |                3.4                |          4.8

Times are in milliseconds (ms).

[------------------------ Downsampling: torch.Size([1, 3, 906, 438]) -> (120, 96) -------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                1.6                |          1.8
      channels_last non-contiguous torch.float32  |                1.6                |          1.9

Times are in milliseconds (ms).

[----------------------- Downsampling: torch.Size([1, 3, 906, 438]) -> (1200, 196) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                9.0                |          11.3
      channels_last non-contiguous torch.float32  |                8.9                |          12.5

Times are in milliseconds (ms).

[----------------------- Downsampling: torch.Size([1, 3, 906, 438]) -> (120, 1200) ------------------------]
                                                  |  Reference, PIL 8.3.2, mode: RGB  |  1.10.0a0+git1e87d91
1 threads: -------------------------------------------------------------------------------------------------
      channels_first contiguous torch.float32     |                2.1                |          1.8
      channels_last non-contiguous torch.float32  |                2.1                |          3.4

Times are in milliseconds (ms).

[--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (320, 196) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.2               |          1.0

Times are in milliseconds (ms).

[--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (460, 220) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               1.4               |          1.3

Times are in milliseconds (ms).

[--------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (120, 96) ---------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |              719.9              |         599.9

Times are in microseconds (us).

[-------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (1200, 196) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |               3.7               |          3.5

Times are in milliseconds (ms).

[-------------- Downsampling: torch.Size([1, 1, 906, 438]) -> (120, 1200) --------------]
                                 |  Reference, PIL 8.3.2, mode: F  |  1.10.0a0+git1e87d91
1 threads: ------------------------------------------------------------------------------
       contiguous torch.float32  |              834.4              |         605.7

Times are in microseconds (us).

```

</details>

Code is moved from torchvision: https://github.com/pytorch/vision/pull/4208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65142

Reviewed By: mrshenli

Differential Revision: D32432405

Pulled By: jbschlosser

fbshipit-source-id: b66c548347f257c522c36105868532e8bc1d4c6d
2021-11-17 09:10:15 -08:00
vfdev-5
6adbe044e3 Added nearest-exact interpolation mode (#64501)
Summary:
Added "nearest-exact" interpolation mode to fix the issues: https://github.com/pytorch/pytorch/issues/34808 and https://github.com/pytorch/pytorch/issues/62237.

Description:

As we can not fix "nearest" mode without large impact on already trained model [it was suggested](https://github.com/pytorch/pytorch/pull/64501#pullrequestreview-749771815) to introduce new mode instead of fixing exising "nearest" mode.

- New mode "nearest-exact" performs index computation for nearest interpolation to match scikit-image, pillow, TF2 and while "nearest" mode still match opencv INTER_NEAREST, which appears to be buggy, see https://ppwwyyxx.com/blog/2021/Where-are-Pixels/#Libraries.

"nearest":
```
input_index_f32 = output_index * scale
input_index = floor(input_index_f32)
```

"nearest-exact"
```
input_index_f32 = (output_index + 0.5) * scale - 0.5
input_index = round(input_index_f32)
```

Comparisions with other libs: https://gist.github.com/vfdev-5/a5bd5b1477b1c82a87a0f9e25c727664

PyTorch version | 1.9.0 "nearest" | this PR "nearest" | this PR "nearest-exact"
---|---|---|---
Resize option: | |
OpenCV INTER_NEAREST result mismatches | 0 | 0 | 10
OpenCV INTER_NEAREST_EXACT result mismatches | 9 | 9 | 9
Scikit-Image result mismatches | 10 | 10 | 0
Pillow result mismatches | 10 | 10 | 7
TensorFlow result mismatches | 10 | 10 | 0
Rescale option: | | |
size mismatches (https://github.com/pytorch/pytorch/issues/62396) | 10 | 10 | 10
OpenCV INTER_NEAREST result mismatches | 3 | 3| 5
OpenCV INTER_NEAREST_EXACT result mismatches | 3 | 3| 4
Scikit-Image result mismatches | 4 | 4 | 0
Scipy result mismatches | 4 | 4 | 0
TensorFlow: no such option | - |  -

Versions:
```
skimage: 0.19.0.dev0
opencv: 4.5.4-dev
scipy: 1.7.2
Pillow: 8.4.0
TensorFlow: 2.7.0
```

Implementations in other libs:

- Pillow:
  - ee079ae67e/src/libImaging/Geometry.c (L889-L899)
  - ee079ae67e/src/libImaging/Geometry.c (L11)
  - `a[2] == 0`

- Scikit-Image :
  - dev v0.19.0 uses scipy ndi.zoom:
    - 38fae50c3f/skimage/transform/_warps.py (L180-L188)
    - 47bb6febaa/scipy/ndimage/src/ni_interpolation.c (L775-L779)
    - 47bb6febaa/scipy/ndimage/src/ni_interpolation.c (L479)

Additionally:
- Updated upsampling tests

cc ezyang gchanan albanD mruberry jbschlosser walterddr fmassa heitorschueroff ppwwyyxx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64501

Reviewed By: anjali411

Differential Revision: D32361901

Pulled By: jbschlosser

fbshipit-source-id: df906f4d25a2b2180e1942ffbab2cc14600aeed2
2021-11-15 14:28:19 -08:00
yanbing-j
12026124cc Avoid the view for mkldnn case in 1D convolution (#68166)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68166

Reviewed By: mrshenli

Differential Revision: D32432444

Pulled By: jbschlosser

fbshipit-source-id: fc4e626d497d9e4597628a18eb89b94518bb3b33
2021-11-15 11:56:45 -08:00
eqy
a1ace029e2 Add host-side memory requirement for test_softmax_64bit_indexing (#67922)
Summary:
https://github.com/pytorch/pytorch/issues/67910
The original `largeTensorTest` decorator didn't account for the additional host-side memory requirements.
Thanks crcrpar for raising the issue, CC ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67922

Reviewed By: malfet

Differential Revision: D32308602

Pulled By: mruberry

fbshipit-source-id: 97b7d2c39fe63c1a8269402f72186026a89f6b4c
2021-11-11 09:24:15 -08:00
Dani El-Ayyass
f171c78c04 add unpack_sequence and unpad_sequence functions (#66550)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66550

Reviewed By: malfet

Differential Revision: D32299193

Pulled By: jbschlosser

fbshipit-source-id: 96c92d73d3d40b7424778b2365e0c8bb1ae56cfb
2021-11-10 15:15:08 -08:00
Joel Schlosser
9a2db6f091 Factor backend routing logic out of convolution forward (#67790)
Summary:
This PR introduces a new function `_select_conv_backend` that returns a `ConvBackend` enum representing the selected backend for a given set of convolution inputs and params.

The function and enum are exposed to python for testing purposes through `torch/csrc/Module.cpp` (please let me know if there's a better place to do this).

A new set of tests validates that the correct backend is selected for several sets of inputs + params. Some backends aren't tested yet:
* nnpack (for mobile)
* xnnpack (for mobile)
* winograd 3x3 (for mobile)

Some flowcharts for reference:
![conv_routing_graph md](https://user-images.githubusercontent.com/75754324/140828957-1135b400-38c0-4c9f-87ef-4f33ceebeeae.png)
![conv_nogroup_routing_graph md](https://user-images.githubusercontent.com/75754324/140828977-ed223a4e-aa86-49f1-9925-c0f6b9ab36af.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67790

Reviewed By: zou3519

Differential Revision: D32280878

Pulled By: jbschlosser

fbshipit-source-id: 0ce55174f470f65c9b5345b9980cf12251f3abbb
2021-11-10 07:53:55 -08:00
Xiao Wang
f6a4c80a5a Refactor cuDNN Convolution memory format and Conv-Bias-Relu code (#65594)
Summary:
This PR makes several changes:

- Changed function `bool cudnn_conv_use_channels_last(...)` to `at::MemoryFormat cudnn_conv_suggest_memory_format(...)`
- Removed `resize_` in cudnn convolution code. Added a new overloading method `TensorDescriptor::set` that also passes the desired memory format of the tensor.
- Disabled the usage of double + channels_last on cuDNN Conv-Relu and Conv-Bias-Relu. Call `.contiguous(memory_format)` before passing data to cuDNN functions.
- Disabled the usage of cuDNN fused Conv-Bias-Relu in cuDNN < 8.0 version due to a CUDNN_STATUS_NOT_SUPPORTED error. Instead, use the native fallback path.
- Let Conv-Bias-Relu code respect the global `allow_tf32` flag.

From cuDNN document, double + NHWC is genenrally not supported.

Close https://github.com/pytorch/pytorch/pull/66968

Fix https://github.com/pytorch/pytorch/issues/55301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65594

Reviewed By: jbschlosser, malfet

Differential Revision: D32175766

Pulled By: ngimel

fbshipit-source-id: 7ba079c9f7c46fc56f8bfef05bad0854acf380d7
2021-11-05 11:50:55 -07:00
Alban Desmaison
bb8978f605 Revert D32175963: Converting hardswish to strucutred kernels with metatensor support
Test Plan: revert-hammer

Differential Revision:
D32175963 (57335a9ee3)

Original commit changeset: f4d749c6aeaf

fbshipit-source-id: 6d68a96cf872c2d7b518c061875b9336bca0043a
2021-11-05 07:04:40 -07:00
John Clow
57335a9ee3 Converting hardswish to strucutred kernels with metatensor support (#66899)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66899

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D32175963

Pulled By: Gamrix

fbshipit-source-id: f4d749c6aeaf064084be72361607ea4f3f6bc91d
2021-11-04 19:02:00 -07:00
soulitzer
83e8612d11 Clean up test autograd (#67413)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/66066

This PR:
 - cleans up op-specific testing from test_autograd. test_autograd should be reserved for testing generic autograd functionality
 - tests related to an operator are better colocated
 - see the tracker for details

What to think about when moving tests to their correct test suite:
 - naming, make sure its not too generic
 - how the test is parametrized, sometimes we need to add/remove a device/dtype parameter
 - can this be merged with existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67413

Reviewed By: jbschlosser, albanD

Differential Revision: D32031480

Pulled By: soulitzer

fbshipit-source-id: 8e13da1e58a38d5cecbfdfd4fe2b4fe6f816897f
2021-11-03 15:26:09 -07:00
Xiao Wang
31cf3d6aad Fix adaptive_max_pool2d for channels-last on CUDA (#67697)
Summary:
Fix https://github.com/pytorch/pytorch/issues/67239

The CUDA kernels for `adaptive_max_pool2d` (forward and backward) were written for contiguous output. If outputs are non-contiguous, first create a contiguous copy and let the kernel write output to the contiguous memory space. Then copy the output from contiguous memory space to the original non-contiguous memory space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67697

Reviewed By: ejguan

Differential Revision: D32112443

Pulled By: ngimel

fbshipit-source-id: 0e3bf06d042200c651a79d13b75484526fde11fe
2021-11-03 09:47:29 -07:00
John Shen
234bd6dc56 [quantized] Add bilinear quantized grid_sample (#66879)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66879

This adds a quantized implementation for bilinear gridsample. Bicubic interpolation cannot be supported as easily since we rely on the linearity of quantization to operate on the raw values, i.e.

f(q(a), q(b)) = q(f(a, b)) where f is the linear interpolation function.
ghstack-source-id: 141321116

Test Plan: test_quantization

Reviewed By: kimishpatel

Differential Revision: D31656893

fbshipit-source-id: d0bc31da8ce93daf031a142decebf4a155943f0f
2021-11-01 14:44:26 -07:00
kshitij12345
885a8e53ba replace onlyOnCPUAndCUDA with onlyNativeDeviceTypes (#65201)
Summary:
Reference https://github.com/pytorch/pytorch/issues/53849

Replace `onlyOnCPUandCUDA` with `onlyNativeDeviceTypes` which includes `cpu, cuda and meta`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65201

Reviewed By: mrshenli

Differential Revision: D31299718

Pulled By: mruberry

fbshipit-source-id: 2d8356450c035d6a314209ab51b2c237583920fd
2021-11-01 09:22:34 -07:00
Joel Schlosser
16d937b0df Fix strided _conv_double_backward() with 3D input / weight (#67283)
Summary:
Removes the 3D special case logic in `_convolution_double_backward()` that never worked.

The logic was never called previously since `convolution()` expands input / weight from 3D -> 4D before passing them to backends; backend-specific backward calls thus save the 4D version to pass to `_convolution_double_backward()`.

The new general `convolution_backward()` saves the original 3D input / weight, uncovering the bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67283

Reviewed By: anjali411

Differential Revision: D32021100

Pulled By: jbschlosser

fbshipit-source-id: 0916bcaa77ef49545848b344d6385b33bacf473d
2021-10-29 09:48:53 -07:00
Sameer Deshmukh
edd4d246c3 Accept 0-dim channel inputs in convolution layer (#66256)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56998 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66256

Reviewed By: mrshenli

Differential Revision: D31859428

Pulled By: jbschlosser

fbshipit-source-id: 034b6c1ce35aac50eabfa09bbcd8b1e3c8b171bd
2021-10-25 08:12:29 -07:00
Eddie Yan
d9c4b3feab Do rowwisemoments computation in float for half LayerNorm (#66920)
Summary:
https://github.com/pytorch/pytorch/issues/66707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66920

Reviewed By: mrshenli

Differential Revision: D31850612

Pulled By: ngimel

fbshipit-source-id: a95a33567285dcf9ee28d33f503cead3268960f9
2021-10-22 09:50:42 -07:00
vfdev
62ca5a81c0 Exposed recompute_scale_factor into nn.Upsample (#66419)
Summary:
Description:
- Exposed recompute_scale_factor into nn.Upsample such that recompute_scale_factor=True option could be used

Context: https://github.com/pytorch/pytorch/pull/64501#discussion_r710205190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66419

Reviewed By: gchanan

Differential Revision: D31731276

Pulled By: jbschlosser

fbshipit-source-id: 2118489e6f5bc1142f2a64323f4cfd095a9f3c42
2021-10-20 07:59:25 -07:00
Jane Xu
9eab6da887 [skip ci] Set test owner for nn tests (#66850)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc albanD mruberry jbschlosser walterddr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66850

Reviewed By: albanD

Differential Revision: D31761712

Pulled By: janeyx99

fbshipit-source-id: 7272154cac77e2ce38370775a9e8d41252e13166
2021-10-19 08:26:50 -07:00
lezcano
0974215c4d Prefer mT and mH over transpose(-2, -1) and transpose(-2, -1).conj() (#64181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181

This PR replaces all the calls to:
- `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python
- `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python.

It also simplifies two pieces of code, and fixes one bug where a pair
of parentheses were missing in the function `make_symmetric_matrices`.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31692896

Pulled By: anjali411

fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a
2021-10-18 13:02:25 -07:00
Tomi Peltola
713e025c9f Add no-input-grad-needed cases to test_grid_sample (#66071)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66071

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D31431801

Pulled By: albanD

fbshipit-source-id: 57a94ed9e97e402aa8193d69355e57b6309c64f7
2021-10-13 10:56:47 -07:00
Natalia Gimelshein
4a50b6c490 fix cosine similarity dimensionality check (#66191)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/66086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66191

Reviewed By: dagitses, malfet

Differential Revision: D31436997

Pulled By: ngimel

fbshipit-source-id: 363556eea4e1696d928ae08320d298451c286b10
2021-10-06 15:44:51 -07:00
lezcano
ca76e193a3 Fix nll_backward for negative weights (#64572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64572

Fixes https://github.com/pytorch/pytorch/issues/64256
It also fixes an inconsistent treatment of the case `reduction = "mean"`
when the whole target is equal to `ignore_index`. It now returns `NaN`
in this case, consistently with what it returns when computing the mean
over an empty tensor.

We add tests for all these cases.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D31116297

Pulled By: albanD

fbshipit-source-id: cc44e79205f5eeabf1efd7d32fe61e26ba701b52
2021-10-01 19:41:51 -07:00
Philip Meier
aebde1bc2b deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31141433

Pulled By: mruberry

fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732
2021-09-29 02:40:52 -07:00
kshitij12345
a012216b96 [nn] Fold : no batch dim (#64909)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64907
Reference: https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64909

Reviewed By: cpuhrsch, heitorschueroff

Differential Revision: D30991087

Pulled By: jbschlosser

fbshipit-source-id: 91a37e0b1d51472935ff2308719dfaca931513f3
2021-09-23 08:37:32 -07:00
Rodrigo Berriel
b80bdcc73b Add register_module alias to nn.Module (#65174)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60397. I'm not sure how aliases are supposed to be implemented, but this is the most basic/direct way, IMO. As a side-effect, this implementation results in a "duplicate" doc entry, inheriting the one from `add_module`:

![monkey-patch](https://user-images.githubusercontent.com/7027770/133693137-8408d8e7-1f4f-436b-b176-57dda9bc3a32.png)

An alternative implementation could be:

```python
def register_module(self, name: str, module: Optional['Module']) -> None:
    r"""Alias for :func:`add_module`."""
    self.add_module(name, module)
```

which results in this documentation:

![image](https://user-images.githubusercontent.com/7027770/133693249-d969a71a-be44-489d-9633-4f38b44ab887.png)

Questions:
1. Should I replicate the tests? There are two for `add_module`: [test_add_module_raises_error_if_attr_exists](873255c6d9/test/test_nn.py (L1420-L1434)) and [test_add_module](873255c6d9/test/test_nn.py (L1837-L1855)).
2. This PR only adds `register_module` to `nn.Module`. There is an `add_module` in [`_RemoteModule`](https://github.com/pytorch/pytorch/blob/master/torch/distributed/nn/api/remote_module.py#L311-L312), which raises `NotSupported`, and there is another one in [`ConcreteModuleTypeBuilder`](873255c6d9/torch/_C/__init__.pyi.in (L468)), which means something else, I think. Should I do anything about them?

cc ngimel SsnL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65174

Reviewed By: soulitzer

Differential Revision: D31089717

Pulled By: jbschlosser

fbshipit-source-id: abd8d14a434fd8c7efa0bd8c242df56da33491e9
2021-09-22 16:37:28 -07:00
kshitij12345
9c23f6eb7d [nn] TripletMarginLoss and PairwiseDistance : no batch dim (#64882)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64882

Reviewed By: malfet

Differential Revision: D31055577

Pulled By: jbschlosser

fbshipit-source-id: 2f0a5a08619b672026b48a78bc7d83a6dccba0bf
2021-09-21 07:29:48 -07:00
Alban Desmaison
d37c02be08 Allow parametrization to be nested (#65167)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65167

Reviewed By: jbschlosser

Differential Revision: D31002318

Pulled By: albanD

fbshipit-source-id: b1f1c6c9efa9e83af9789ed13efc133f777f418e
2021-09-17 07:29:01 -07:00
kshitij12345
01e92f2a56 [nn] no batch dim support: CosineEmbeddingLoss (#64590)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/60585

TODO
* [x] Add tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64590

Reviewed By: H-Huang

Differential Revision: D30900775

Pulled By: jbschlosser

fbshipit-source-id: d24e72787017e79afbf8f04a94901a290485b81a
2021-09-13 10:45:33 -07:00
Aswin John Mathews
63b180beed ROCm MIOpen NHWC Convolution support (#63617)
Summary:
- Added 2D-Convolution NHWC support
  - on ROCm 4.3, with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` flag
  - May need to force MIOpen to search for solutions ( see examples below for flags )

**PYTORCH_MIOPEN_SUGGEST_NHWC Environment Flag**
MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. This flag is intended to be a short-lived flag to explicitly turn on NHWC support until ROCm officially supports NHWC and performance is verified.

**Examples**
1. Example usage 1 : Run test on ROCm4.3
`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_MIOPEN_SUGGEST_NHWC=1 MIOPEN_FIND_ENFORCE=4 MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_MODE=1 pytest test_nn.py -v -k "test_conv_cudnn_nhwc" `
2. Example usage 2: Run the following with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` on ROCm4.3.
```
#!/usr/bin/env python3
import torch
model = torch.nn.Conv2d(8, 4, 3).cuda().half()
model = model.to(memory_format=torch.channels_last)
input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print(input.is_contiguous(memory_format=torch.channels_last), input.shape, input.stride() )

out = model(input)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print("Contiguous channel last :", out.is_contiguous(memory_format=torch.channels_last), " out shape :",  out.shape, "out stride :", out.stride() )
```

See https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html for more examples.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63617

Reviewed By: saketh-are

Differential Revision: D30730800

Pulled By: ezyang

fbshipit-source-id: 61906a0f30be8299e6547d312ae6ac91cc7c3238
2021-09-10 08:06:32 -07:00
Sameer Deshmukh
7205ca0210 Change MaxUnpool to accept tensors with 0-dim batch sizes. (#64082)
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/38115.

Changes the `MaxUnpool` module to work with 0-dimensions batch sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64082

Reviewed By: mrshenli

Differential Revision: D30793907

Pulled By: jbschlosser

fbshipit-source-id: d21aa665be5aa18f592b39ef7b4e3cbc632e21ed
2021-09-08 08:41:09 -07:00
Philip Meier
26b7ff5aea deprecate dtype getters from torch.testing namespace (#63554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554

Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:

1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.

We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30662206

Pulled By: mruberry

fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56
2021-09-07 08:58:51 -07:00
Richard Zou
535526b95c Restore LayerNorm numerics test (#64385)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64385

It was deleted in https://github.com/pytorch/pytorch/pull/63276.

The numerics test was meant to check LayerNorm behavior on large inputs,
but we deleted it without realizing that.

Test Plan: - wait for tests.

Reviewed By: ngimel

Differential Revision: D30702950

Pulled By: zou3519

fbshipit-source-id: a480e26c45ec38fb628938b70416cdb22d976a46
2021-09-01 15:32:49 -07:00
Kushashwa Ravi Shrimali
d5bfdd3dac OpInfo for nn.functional.layer_norm (#63276)
Summary:
Please see https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

Note:

* This PR also adds a reference test inspired by existing tests in `test_nn.py`.

cc: mruberry zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63276

Reviewed By: ejguan

Differential Revision: D30452483

Pulled By: zou3519

fbshipit-source-id: 2578d01ca34e031668a41bd284db60c31ae1fba8
2021-09-01 09:31:45 -07:00
Kushashwa Ravi Shrimali
ca8dd296ee Add OpInfo for nn.functional.cosine_similarity (#62959)
Summary:
Please see https://github.com/facebookresearch/functorch/issues/78 and https://github.com/pytorch/pytorch/issues/54261.

Notes:

* Some redundant tests from `test_nn.py` have been removed. I'm unsure about precision checks if they can be removed as well.
* Broadcasting is also checked in the OpInfo for `cosine_similarity`.

cc: mruberry zou3519 Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62959

Reviewed By: heitorschueroff

Differential Revision: D30520176

Pulled By: zou3519

fbshipit-source-id: 14e902eb4bcce875edab28a1669a2ea021052b9b
2021-08-31 10:31:36 -07:00
CaoE
cb7cf823b3 add BFloat16 support for fold and unfold on CPU (#62880)
Summary:
Add BFloat16 support for fold and unfold operators on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62880

Reviewed By: iramazanli

Differential Revision: D30576387

Pulled By: zou3519

fbshipit-source-id: c48f6e56702bfea34448db1b3a1634c49c5d8ec8
2021-08-30 19:14:10 -07:00
lezcano
f3e329cbec Implements the orthogonal parametrization (#62089)
Summary:
Implements an orthogonal / unitary parametrisation.

It does passes the tests and I have trained a couple models with this implementation, so I believe it should be somewhat correct. Now, the implementation is very subtle. I'm tagging nikitaved  and IvanYashchuk as reviewers in case they have comments / they see some room for optimisation of the code, in particular of the `forward` function.

Fixes https://github.com/pytorch/pytorch/issues/42243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62089

Reviewed By: ezyang

Differential Revision: D30639063

Pulled By: albanD

fbshipit-source-id: 988664f333ac7a75ce71ba44c8d77b986dff2fe6
2021-08-30 13:12:07 -07:00
Peter Bell
5b0dfd0f8a Fix bad use of channels last kernel in sync batch norm backward (#64100)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64039

There are two distinct problems here.
1. If `grad_output` is channels last but not input, then input would be read as-if it were channels last. So reading the wrong values.
2. `use_channels_last_kernels` doesn't guarunte that `suggest_memory_format` will actually return channels last, so use `empty_like` instead so the strides always match.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64100

Reviewed By: mruberry

Differential Revision: D30622127

Pulled By: ngimel

fbshipit-source-id: e28cc57215596817f1432fcdd6c49d69acfedcf2
2021-08-30 12:16:30 -07:00
Thomas J. Fan
d3bcba5f85 ENH Adds label_smoothing to cross entropy loss (#63122)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/7455

Partially resolves pytorch/vision#4281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63122

Reviewed By: iramazanli

Differential Revision: D30586076

Pulled By: jbschlosser

fbshipit-source-id: 06afc3aa1f8b9edb07fe9ed68c58968ad1926924
2021-08-29 23:33:04 -07:00
mingfeima
c5ed31e4a7 add channel last support for MaxUnpool2d (#49984)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49984

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26007051

Pulled By: VitalyFedyunin

fbshipit-source-id: 6c54751ade4092e03c1651aaa60380f7d6e92f6b
2021-08-29 18:37:10 -07:00
BBuf
6ab3a21098 fix resize bug (#61166)
Summary:
I think the original intention here is to only take effect in the case of align_corners (because output_size = 1 and the divisor will be 0), but it affects non-align_corners too. For example:

```python
input = torch.tensor(
        np.arange(1, 5, dtype=np.int32).reshape((1, 1, 2, 2)) )
m = torch.nn.Upsample(scale_factor=0.5, mode="bilinear")
of_out = m(input)
```

The result we expect should be [[[[2.5]]]]

but pytorch get [[[[1.0]]]] which is different from OpenCV  and PIL, this pr try to fixed it。

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61166

Reviewed By: malfet

Differential Revision: D30543178

Pulled By: heitorschueroff

fbshipit-source-id: 21a4035483981986b0ae4a401ef0efbc565ccaf1
2021-08-27 10:49:31 -07:00
Philip Meier
57d4c6cf42 replace self.assertTrue(torch.allclose(..)) with self.assertEqual(…) (#63637)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63637

Reviewed By: malfet

Differential Revision: D30541266

Pulled By: mruberry

fbshipit-source-id: ab461949782c6908a589ea098fcfcf5c3e081ee6
2021-08-25 16:47:40 -07:00
mingfeima
b0782f0f32 add BFloat16 support for bernoulli and Dropout on CPU (#56372)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56372

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D28836792

Pulled By: VitalyFedyunin

fbshipit-source-id: ede951d172a59276e11383fd767778ab959b5a6b
2021-08-25 12:01:27 -07:00
Joel Schlosser
544af391b5 Allow arbitrary objects in state_dicts (#62976)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62094

Introduces functionality for adding arbitrary objects to module state_dicts. To take advantage of this, the following functions can be defined on a module:
* `get_extra_state(self) -> dict` - Returns a dict defining any extra state this module wants to save
* `set_extra_state(self, state)` - Subsumes the given state within the module

In the details, a sub-dictionary is stored in the state_dict under the key `_extra_state` for each module that requires extra state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62976

Reviewed By: heitorschueroff

Differential Revision: D30518657

Pulled By: jbschlosser

fbshipit-source-id: 5fb35ab8e3d36f35e3e96dcd4498f8c917d1f386
2021-08-24 19:06:14 -07:00
soulitzer
5be17ec1fc Do not modify saved variables in-place for spectral norm during power iteration (#62293)
Summary:
Interestingly enough, the original code did have a mechanism that aims to prevent this very issue:
but it performs a clone AFTER modifying u and v in-place.
This wouldn't work though because we can later use the cloned u and v in operations that save for backward, and the next time we execute forward, we modify the same cloned u and v in-place.
So if the idea is that we want to avoid modifying saved variable in-place we should clone it BEFORE the in-place operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62293

Reviewed By: bdhirsh

Differential Revision: D30489750

Pulled By: soulitzer

fbshipit-source-id: cbe8dea885aef97adda8481f7a822e5bd91f7889
2021-08-24 13:08:59 -07:00
mingfeima
d3be02d100 fix batchnorm2d issue when input is non contiguous (#63392)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63392

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30476317

Pulled By: VitalyFedyunin

fbshipit-source-id: 03055a0aec21cf2c029b6f32315da2b09cb722d0
2021-08-24 08:24:01 -07:00
mingfeima
5b7cdc5a3d add channels last for GroupNorm (#49821)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49821

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26007053

Pulled By: VitalyFedyunin

fbshipit-source-id: 34a48d5d3b66a159febf3c3d96748fbaba1b9e31
2021-08-23 22:54:59 -07:00
Jeff Daily
a8de0d83fe empty caching allocator before test_avg_pool2d large subtest (#63528)
Summary:
Otherwise, unrecoverable OOM occurs on MI25.  Fixes broken ROCm CI test1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63528

Reviewed By: malfet, zhouzhuojie

Differential Revision: D30459151

Pulled By: walterddr

fbshipit-source-id: 63e205c4f486fcbdd514cfb0ed8e38584f894585
2021-08-20 14:01:45 -07:00
Philip Meier
99203580a9 Updates internal assert_allclose callsites in favor of assert_close (#61841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61841

Redo of #60863.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30408145

Pulled By: mruberry

fbshipit-source-id: 0b34ebc7f23ba38ecd89640b61d8aca59b7eab58
2021-08-19 12:50:41 -07:00
kshitij12345
3ce67efea2 [opinfo] nn.functional.pad (#62814)
Summary:
Reference: https://github.com/facebookresearch/functorch/issues/78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62814

Reviewed By: VitalyFedyunin

Differential Revision: D30307492

Pulled By: zou3519

fbshipit-source-id: 4f6062eb4a3c91ed1795df1f82846afa0abafcdc
2021-08-16 13:29:34 -07:00
leslie-fang-intel
385b082854 add substract of max and testcase (#63132)
Summary:
As discussed here https://github.com/pytorch/pytorch/pull/62897, in the path of BF16/non-last-dim Softmax, we miss the subtractions of max value which will cause the overflow in the `exp()` calculation when the value of input tensor is large, such as `1000.0`.
To avoid this issue, we add the subtractions of max value and the corresponding test cases in this PR.

Note w/o subtractions of max value(accidental reverts or changes), we will get the underlying error message of the test case
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.05 and atol=0.05, found 103984 element(s) (out of 126720) whose difference(s) exceeded the margin of error (including 103984 nan comparisons). The greatest difference was nan (0.0 vs. nan), which occurred at index (0, 0, 0, 1).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63132

Reviewed By: VitalyFedyunin

Differential Revision: D30280792

Pulled By: cpuhrsch

fbshipit-source-id: 722821debf983bbb4fec878975fa8a4da0d1d866
2021-08-13 20:50:49 -07:00
Sameer Deshmukh
809e1e7457 Allow TransformerEncoder and TransformerDecoder to accept 0-dim batch sized tensors. (#62800)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

This PR allows TransformerEncoder and Decoder (alongwith the inner `Layer` classes) to accept inputs with 0-dimensional batch sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62800

Reviewed By: VitalyFedyunin

Differential Revision: D30303240

Pulled By: jbschlosser

fbshipit-source-id: 8f8082a6f2a9f9d7ce0b22a942d286d5db62bd12
2021-08-13 16:11:57 -07:00
Sameer Deshmukh
38a825c648 Allow Average Pooling modules to accept tensors with 0-dim batch sizes. (#62025)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

It introduces changes and tests for allowing the Average Pooling layers to accept tensors with 0 sized batch dimensions and return meaningful results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62025

Reviewed By: VitalyFedyunin

Differential Revision: D30303256

Pulled By: jbschlosser

fbshipit-source-id: 5f727e62a7c58d2b8bb49fcc3bd7688474917ba5
2021-08-13 11:31:17 -07:00
Sameer Deshmukh
cb23976f9f Allow 0-dim batch sizes for AdaptiveMaxPool and MaxPool. (#62088)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

This PR allows `MaxPool` and `AdaptiveMaxPool` to accept tensors whose batch size is 0. Some changes have been made to modernize the tests so that they will show the name of C++ function that throws an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62088

Reviewed By: bdhirsh

Differential Revision: D30281285

Pulled By: jbschlosser

fbshipit-source-id: 52bffc67bfe45a78e11e4706b62cce1469eba1b9
2021-08-13 07:33:17 -07:00
Shen Li
1022443168 Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer

Differential Revision:
D30279364 (b004307252)

Original commit changeset: c1ed77dfe43a

fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e
2021-08-12 11:45:01 -07:00
Zsolt Dollenstein
b004307252 [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle

Reviewed By: zertosh

Differential Revision: D30279364

fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
2021-08-12 10:58:35 -07:00
Christian Puhrsch
3beb65d45d test_cudnn_convolution_relu skipCUDAIfRocm
Summary: skip rocm test for test_cudnn_convolution_relu

Test Plan: This skips a test

Reviewed By: ngimel

Differential Revision: D30233620

fbshipit-source-id: 31eab8b03c3f15674e0d262a8f55965c1aa6b809
2021-08-10 15:15:23 -07:00
Sameer Deshmukh
9e7b6bb69f Allow LocalResponseNorm to accept 0 dim batch sizes (#62801)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

This PR allows `LocalResponseNorm` to accept tensors with 0 dimensional batch size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62801

Reviewed By: zou3519

Differential Revision: D30165282

Pulled By: jbschlosser

fbshipit-source-id: cce0b2d12dbf47dc8ed6247c267bf2f2305f858a
2021-08-10 06:54:52 -07:00
=
084e92bb76 Use output memory format based on input for cudnn_convolution_relu (#62482)
Summary:
Currently when cudnn_convolution_relu is passed a channels last Tensor it will return a contiguous Tensor. This PR changes this behavior and bases the output format on the input format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62482

Reviewed By: ngimel

Differential Revision: D30049905

Pulled By: cpuhrsch

fbshipit-source-id: 98521d14ee03466e7128a1912b9f754ffe10b448
2021-08-09 15:31:53 -07:00
Natalia Gimelshein
e6a3154519 Allow broadcasting along non-reduction dimension for cosine similarity (#62912)
Summary:
Checks introduced by https://github.com/pytorch/pytorch/issues/58559 are too strict and disable correctly working cases that people were relying on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62912

Reviewed By: jbschlosser

Differential Revision: D30165827

Pulled By: ngimel

fbshipit-source-id: f9229a9fc70142fe08a42fbf2d18dae12f679646
2021-08-06 19:17:04 -07:00
Sameer Deshmukh
f6c7081a16 Allow FractionalMaxPool 2D and 3D layers to accept 0 dim batch size tensors. (#62083)
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in  https://github.com/pytorch/pytorch/issues/38115.

Allow `FractionalMaxPool` 2D and 3D layers to accept 0 dim batch sizes. Also make some minor corrections to error messages to make them more informative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62083

Reviewed By: H-Huang

Differential Revision: D30134461

Pulled By: jbschlosser

fbshipit-source-id: 0ec50875d36c2083a7f06d9ca6a110fb3ec4f2e2
2021-08-05 17:40:10 -07:00
kshitij12345
64c54f92ca [opinfo] nn.functional.unfold (#62705)
Summary:
Reference: facebookresearch/functorch#78

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62705

Reviewed By: H-Huang

Differential Revision: D30138807

Pulled By: zou3519

fbshipit-source-id: 1d0b0e58feb13aec7b231c9f632a6d1694b9d272
2021-08-05 17:12:25 -07:00
Eddie Yan
878943c64f Preserve memory layout when aten batchnorm is used (#62773)
Summary:
https://github.com/pytorch/pytorch/issues/62594

CC cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62773

Reviewed By: H-Huang

Differential Revision: D30118658

Pulled By: cpuhrsch

fbshipit-source-id: bce9e92f5f8710c876a33cccbd1625155496ddea
2021-08-05 10:21:44 -07:00
yanbing-j
c7a7c2b62f Enable Gelu fp32/bf16 in CPU path using Mkldnn implementation (#58525)
Summary:
Enable Gelu bf16/fp32 in CPU path using Mkldnn implementation. User doesn't need to_mkldnn() explicitly. New Gelu fp32 performs better than original one.

Add Gelu backward for https://github.com/pytorch/pytorch/pull/53615.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58525

Reviewed By: ejguan

Differential Revision: D29940369

Pulled By: ezyang

fbshipit-source-id: df9598262ec50e5d7f6e96490562aa1b116948bf
2021-08-03 06:52:23 -07:00
Joel Schlosser
a42345adee Support for target with class probs in CrossEntropyLoss (#61044)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11959

Alternative approach to creating a new `CrossEntropyLossWithSoftLabels` class. This PR simply adds support for "soft targets" AKA class probabilities to the existing `CrossEntropyLoss` and `NLLLoss` classes.

Implementation is dumb and simple right now, but future work can add higher performance kernels for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61044

Reviewed By: zou3519

Differential Revision: D29876894

Pulled By: jbschlosser

fbshipit-source-id: 75629abd432284e10d4640173bc1b9be3c52af00
2021-07-29 10:04:41 -07:00
Joel Schlosser
35307b131d Callable activation function support for Transformer modules (Python) (#61355)
Summary:
Fixes Python part of https://github.com/pytorch/pytorch/issues/60747

Enhances the Python versions of `Transformer`, `TransformerEncoderLayer`, and `TransformerDecoderLayer` to support callables as their activation functions. The old way of specifying activation function still works as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61355

Reviewed By: bdhirsh

Differential Revision: D29967302

Pulled By: jbschlosser

fbshipit-source-id: 8ee6f20083d49dcd3ab432a18e6ad64fe1e05705
2021-07-28 21:42:56 -07:00
Pritam Damania
cac4aa71ca Provide option to pass module instance to _load_state_dict_pre_hooks. (#62070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62070

We have a custom Tensor:
https://github.com/pytorch/pytorch/blob/master/torch/distributed/_sharded_tensor/api.py#L67,
which doesn't show up in state_dict for the module. This was resolved by
using the _register_state_dict_hook:
https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1196
to parse and add custom tensors to state_dict.

However, the problem is during load time  _register_load_state_dict_pre_hook:
https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/module.py#L1272,
does not pass in the module instance and as a result, a ShardedTensor in the
state_dict cannot be appropriately added to a module at load time.

To resolve this issue, in this PR I've enhanced this hook to support two
variations, one which passes in the module instance (for the problem described
above) and one is the previous version for BC reasons.
ghstack-source-id: 134541391

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: jbschlosser

Differential Revision: D29867142

fbshipit-source-id: bcb136ff51eedd0b508cfb419e8b8a6b7d95539c
2021-07-28 19:22:47 -07:00
Thomas J. Fan
71a6ef17a5 ENH Adds no_batch_dim tests/docs for Maxpool1d & MaxUnpool1d (#62206)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62206

Reviewed By: ejguan

Differential Revision: D29942341

Pulled By: jbschlosser

fbshipit-source-id: a3fad774cee30478f7d6cdd49d2eec31be3fc518
2021-07-28 10:15:32 -07:00
leslie-fang-intel
7443c90f15 optimize non lastdim softmax bf16 (#60371)
Summary:
Here is the PR to enable the softmax calculation with data type of `bfloat16` when not along the last dim.
* Use bf16 specialization for forward calculation to reduce the bf16/fp32 cast in vec template.
* Release the bf16 limitation for backward calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60371

Reviewed By: ejguan

Differential Revision: D29563109

Pulled By: cpuhrsch

fbshipit-source-id: f6b439fa3850a6c633f35db65ea3d735b747863e
2021-07-28 10:06:51 -07:00
Peter Bell
9776e1ff2f Migrate thnn_conv_depthwise2d from THC to ATen (#62281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62281

Closes gh-24646, Closes gh-24647

There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.

I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')

for _ in range(100):
    torch.nn.functional.conv2d(x, w, groups=10)
```

and similarly for backwards. I see these as the same to within measurement error.

|                   | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
|           Forward |        133.5        |         133.6        |
|  Backward (input) |        1,102        |         1,119        |
| Backward (weight) |        2,220        |         2,217        |

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29943062

Pulled By: ngimel

fbshipit-source-id: fc5d16496eb733743face7c5a14e532d7b8ee26a
2021-07-27 16:51:23 -07:00
Sameer Deshmukh
4a15f4a902 Allow 0-dim batch sizes in Bilinear NN layer. (#47106)
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/12013

Checks if the inputs and outputs are non-zero in order to allow the Bilinear layer to accept 0-dim batch sizes. The if-check for this checks for both input and output dim sizes since the `_trilinear` function is written to work with both forward and backward for Bilinear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47106

Reviewed By: ejguan

Differential Revision: D29935589

Pulled By: jbschlosser

fbshipit-source-id: 607d3352bd4f88e2528c64408f04999960be049d
2021-07-27 13:59:42 -07:00
Erjia Guan
acaac70f63 Revert D29883676: Migrate thnn_conv_depthwise2d from THC to ATen
Test Plan: revert-hammer

Differential Revision:
D29883676 (de3a4eb583)

Original commit changeset: 9b2ac62cdd8a

fbshipit-source-id: d211d3cb7723b5d2e73de6941a7e649e5f78864f
2021-07-27 11:28:52 -07:00
Peter Bell
de3a4eb583 Migrate thnn_conv_depthwise2d from THC to ATen (#62006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62006

Closes gh-24646, gh-24647

There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.

I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')

for _ in range(100):
    torch.nn.functional.conv2d(x, w, groups=10)
```

and similarly for backwards. I see these as the same to within measurement error.

|                   | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
|           Forward |        133.5        |         133.6        |
|  Backward (input) |        1,102        |         1,119        |
| Backward (weight) |        2,220        |         2,217        |

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D29883676

Pulled By: ngimel

fbshipit-source-id: 9b2ac62cdd8a84e1a23ffcd66035b2b2fe2374d8
2021-07-27 10:00:25 -07:00
Thomas J. Fan
89ca638c18 ENH Adds no batch dim support for AdativeMaxPool*D (#61847)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61847

Reviewed By: suo

Differential Revision: D29883887

Pulled By: jbschlosser

fbshipit-source-id: de3fcf1cc3878b138ab766d2a50cc59c52ec5a60
2021-07-26 07:35:36 -07:00
Thomas J. Fan
f03e7170f0 ENH Updates docs and tests for regression modules that already support no-batch-dims (#61461)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

This PR does not use `check_sum_reduction` because I wanted to test every reduction option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61461

Reviewed By: suo

Differential Revision: D29883744

Pulled By: jbschlosser

fbshipit-source-id: cdad0effb41f0484938caad0d4c9d6d83e2aec07
2021-07-23 16:40:17 -07:00
Thomas J. Fan
1ec6205bd0 ENH Adds no_batch_dim support for maxpool and unpool for 2d and 3d (#61984)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

(Interesting how the maxpool tests are currently in `test/test_nn.py`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61984

Reviewed By: suo

Differential Revision: D29883846

Pulled By: jbschlosser

fbshipit-source-id: 1e0637c96f8fa442b4784a9865310c164cbf61c8
2021-07-23 16:14:10 -07:00
Joel Schlosser
f4ffaf0cde Fix type promotion for cosine_similarity() (#62054)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62054

Reviewed By: suo

Differential Revision: D29881755

Pulled By: jbschlosser

fbshipit-source-id: 10499766ac07b0ae3c0d2f4c426ea818d1e77db6
2021-07-23 15:20:48 -07:00
Peter Bell
0df1679e5c BatchNorm: fix mixed precision usage with affine=False (#61962)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61924

The fused backward kernel was using the weight dtype to detect mixed precision usage, but the weights can be none and the `running_mean` and `running_var` can still be mixed precision. So, I update the check to look at those variables as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61962

Reviewed By: albanD

Differential Revision: D29825516

Pulled By: ngimel

fbshipit-source-id: d087fbf3bed1762770cac46c0dcec30c03a86fda
2021-07-23 09:55:52 -07:00
Vitaly Fedyunin
b60d1b713e Revert D26007050: add channels last support for thnn_conv2d (non-dilated)
Test Plan: revert-hammer

Differential Revision:
D26007050 (8b88c24670)

Original commit changeset: 1289e0687c24

fbshipit-source-id: 88b679efbcae572fe604d50e2199861cadbc3d4a
2021-07-22 08:31:15 -07:00
Thomas J. Fan
17d743ff04 ENH Adds test and docs for dropout for no batch dims (#61911)
Summary:
Towards https://github.com/pytorch/pytorch/issues/60585

I think `Dropout` is already tested in `test_Dropout` for no batch dims.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61911

Reviewed By: albanD

Differential Revision: D29810928

Pulled By: jbschlosser

fbshipit-source-id: 7716a1a808e9e34aae43573f38706212552afbb4
2021-07-21 09:07:10 -07:00
Thomas J. Fan
48af9de92f ENH Enables No-batch for *Pad1d Modules (#61060)
Summary:
Toward https://github.com/pytorch/pytorch/issues/60585

This PR adds a `single_batch_reference_fn` that uses the single batch implementation to check no-batch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61060

Reviewed By: mrshenli

Differential Revision: D29739823

Pulled By: jbschlosser

fbshipit-source-id: d90d88a3671177a647171801cc6ec7aa3df35482
2021-07-21 07:12:41 -07:00
Calvin McCarter
bdf439a958 Adds _LazyInstanceNorm and LazyInstanceNormXd (#60982)
Summary:
Signed-off-by: Calvin McCarter <calvin@lightmatter.co>

Fixes https://github.com/pytorch/pytorch/issues/60981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60982

Reviewed By: albanD

Differential Revision: D29810547

Pulled By: jbschlosser

fbshipit-source-id: d933d4c7fe5cf7be9b09a5ab93f740b94cf08cc1
2021-07-21 06:45:45 -07:00
mingfeima
8b88c24670 add channels last support for thnn_conv2d (non-dilated) (#49582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49582

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26007050

Pulled By: VitalyFedyunin

fbshipit-source-id: 1289e0687c2459dd4eb8e4ba2efc8266397cfe5f
2021-07-20 12:50:24 -07:00
Xiong Wei
45751e0b34 Support integral target for the backward of nn.SmoothL1Loss (#61112)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58816

- enhance the backward of `nn.SmoothL1Loss` to allow integral `target`
- add test cases in `test_nn.py` to check the `input.grad` between the integral input and its floating counterpart.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61112

Reviewed By: mrshenli

Differential Revision: D29775660

Pulled By: albanD

fbshipit-source-id: 544eabb6ce1ea13e1e79f8f18c70f148e92be508
2021-07-20 10:24:03 -07:00
Joel Schlosser
aa01a7a61c Fix for get_buffer(): check buffers by name instead of value (#61429)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61242

Previous code was wrongly checking if a tensor is a buffer in a module by comparing values; fix compares names instead.
Docs need some updating as well- current plan is to bump that to a separate PR, but I'm happy to do it here as well if preferred.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61429

Reviewed By: gchanan

Differential Revision: D29712341

Pulled By: jbschlosser

fbshipit-source-id: 41f29ab746505e60f13de42a9053a6770a3aac22
2021-07-15 09:55:09 -07:00
John Shen
343cb276b0 [pytorch] Add broadcasting support to add_relu kernel (#61584)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61584

add_relu is not working with broadcasting. This registers a scalar version of add_relu in native_functions that casts to tensor before calling the regular function. TensorIterator handles broadcasting analogously to existing add.
ghstack-source-id: 133480068

Test Plan: python3 test/test_nn.py TestAddRelu

Reviewed By: kimishpatel

Differential Revision: D29641768

fbshipit-source-id: 1b0ecfdb7eaf44afed83c9e9e74160493c048cbc
2021-07-14 10:32:20 -07:00
Joel Schlosser
4d842d909b Revert FC workaround for ReflectionPad3d (#61308)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61308

Reviewed By: iramazanli

Differential Revision: D29566849

Pulled By: jbschlosser

fbshipit-source-id: 8ab443ffef7fd9840d64d71afc2f2d2b8a410ddb
2021-07-12 14:19:07 -07:00
Xiao Wang
5a17cb6f44 Add channels-last support for bilinear and nearest 2d interpolation on CUDA (#56322)
Summary:
Add channels-last support for bilinear and nearest 2d interpolation on CUDA

Benchmark (on 2070 Super) is available at

- nearest 2d: https://github.com/xwang233/code-snippet/tree/master/interpolate-channels-last/nearest-2d
- bilinear: https://github.com/xwang233/code-snippet/tree/master/interpolate-channels-last/bilinear

Some regressions are seen for tensors with small channel size. We may add a heuristic to dispatch the contiguous and channels-last path if needed.

Close https://github.com/pytorch/pytorch/issues/60137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56322

Reviewed By: mruberry

Differential Revision: D29645980

Pulled By: ngimel

fbshipit-source-id: c36dff4ee4789bec9b01da4029f326d30067c6b7
2021-07-10 18:00:50 -07:00
mingfeima
8bec478a9e MaxPool2d: use channels_last format for both output and indice when input is channels_last (#61245)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61245

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29557884

Pulled By: ezyang

fbshipit-source-id: 0d2b8cbaaf13411eefd7d867021bd6028d40e5cc
2021-07-07 07:50:28 -07:00
mingfeima
652d911f81 add BFloat16 support for LayerNorm CPU (#55210)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55210

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28836793

Pulled By: VitalyFedyunin

fbshipit-source-id: 998298deedd7a18e45fb761a0a4e0d88b65f2e0c
2021-06-29 14:08:30 -07:00
Karen Zhou
965dad25a5 Allow resizing of parametrized tensors (#60418)
Summary:
Modify `parametrize.py` to allow resizing of parametrized tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60418

Test Plan:
`buck test mode/dev-nosan //caffe2/test:nn -- --exact 'caffe2/test:nn - test_register_and_remove_parametrization (test_nn.TestNN)'`

https://pxl.cl/1L0wh

Reviewed By: z-a-f

Differential Revision: D29279442

Pulled By: kazhou

fbshipit-source-id: 4d94915748f896e7761a40ad18f4c6444f505c3a
2021-06-28 12:57:11 -07:00
joerg-de
387289d4a5 support non-contiguous tensor in bilinear (#38409)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38409

Reviewed By: anjali411

Differential Revision: D29361043

Pulled By: albanD

fbshipit-source-id: 05147a9b0f7a47204bcd5ff70e281a464e8de1e6
2021-06-28 10:43:21 -07:00
Thomas J. Fan
e63db3ae46 ENH Adds byte support for nll_loss (CUDA) (#60650)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59765

This PR adds byte support for nll_loss on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60650

Reviewed By: albanD

Differential Revision: D29429456

Pulled By: jbschlosser

fbshipit-source-id: 894c969ed6bfc6117dee8e844a7cb5b99977247c
2021-06-28 08:20:13 -07:00
Natalia Gimelshein
5b118a7f23 Don't reference reflection_pad3d in functional.py (#60837)
Summary:
To work around FC issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60837

Reviewed By: jbschlosser

Differential Revision: D29421142

Pulled By: ngimel

fbshipit-source-id: f5c1d9c324173b628e286f9005edf7109162066f
2021-06-27 20:54:32 -07:00
mingfeima
dd045ab540 add channels last for AdapativeMaxPool2d (#48920)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48920

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D25399467

Pulled By: VitalyFedyunin

fbshipit-source-id: d9d2cc728cc7a18a26983e96d3c3e81a23659e89
2021-06-25 16:36:20 -07:00
Hongbo Zhang
ad69e2fd11 [torch] Module fix on the support of LazyModule on bug #60132 (#60517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60517

This is to fix the module support on lazymodulefixin on the bug issue #60132
Check the link: https://github.com/pytorch/pytorch/issues/60132

We will have to update lazy_extension given the dependency on module.py and update the unit test as well.

Test Plan:
Unit test passes

torchrec test passes

Reviewed By: albanD

Differential Revision: D29274068

fbshipit-source-id: 1c20f7f0556e08dc1941457ed20c290868346980
2021-06-25 16:20:19 -07:00
lezcano
3a838e4ce3 Parametrizations depending on several inputs (#60530)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/58488

There was a line that had been changed in `test_nn.py` as caught in https://github.com/pytorch/pytorch/pull/58488#discussion_r651267668

I reverted that line, which should never have been changed. I reckon that should solve the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60530

Reviewed By: ngimel

Differential Revision: D29329865

Pulled By: albanD

fbshipit-source-id: 8dfd0cd968fe26a3924dae7ca366af2c8a8639b3
2021-06-25 09:16:57 -07:00
Xiaomeng Yang
963c983366 Improve numerical stability of LayerNorm (#59987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59987

Similar as GroupNorm, improve numerical stability of LayerNorm by Welford algorithm and pairwise sum.

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"

Reviewed By: ngimel

Differential Revision: D29115235

fbshipit-source-id: 5183346c3c535f809ec7d98b8bdf6d8914bfe790
2021-06-25 02:22:42 -07:00
mingfeima
5a077bb10b Optimize some redunction operators on CPU BFloat16 (#55202)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55202

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28836790

Pulled By: VitalyFedyunin

fbshipit-source-id: f3a29633d85eb5a614652e568140e9b19509f959
2021-06-24 10:50:24 -07:00
Thomas J. Fan
99b641169b Migrates nll_loss_forward from TH to Aten (CUDA) (#60097)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24610
Aten Umbrella issue https://github.com/pytorch/pytorch/issues/24507
Related to https://github.com/pytorch/pytorch/issues/59765

The performance does not change between this PR and master with the following benchmark script:

<details>
 <summary>Benchmark script</summary>

```python
import torch
import torch.nn as nn
import time

torch.manual_seed(0)

def _time():
    torch.cuda.synchronize()
    MS_PER_SECOND = 1000
    return time.perf_counter() * MS_PER_SECOND

device = "cuda"
C = 30
softmax = nn.LogSoftmax(dim=1)
n_runs = 250

for reduction in ["none", "mean", "sum"]:
    for N in [100_000, 500_000, 1_000_000]:
        fwd_t = 0
        bwd_t = 0
        data = torch.randn(N, C, device=device)
        target = torch.empty(N, dtype=torch.long, device=device).random_(0, C)
        loss = nn.NLLLoss(reduction=reduction)
        input = softmax(data)

        for i in range(n_runs):
            t1 = _time()
            result = loss(input, target)
            t2 = _time()
            fwd_t = fwd_t + (t2 - t1)
        fwd_avg = fwd_t / n_runs
        print(
            f"input size({N}, {C}), reduction: {reduction} "
            f"forward time is {fwd_avg:.2f} (ms)"
        )
    print()
```

</details>

## master

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.81 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

## this PR

```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)

input size(100000, 30), reduction: mean forward time is 1.80 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)

input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60097

Reviewed By: mrshenli

Differential Revision: D29303099

Pulled By: ngimel

fbshipit-source-id: fc0d636543a79ea81158d286dcfb84043bec079a
2021-06-23 19:47:01 -07:00
Thomas J. Fan
da030c59e7 ENH Adds Byte support for nll_loss (CPU) (#60308)
Summary:
Addresses a part of https://github.com/pytorch/pytorch/issues/59765

This PR adds byte support for nll_loss on the CPU for `input.dim() == 2`.

CUDA support will be implemented when `nll_loss` migration to CUDA is completed in https://github.com/pytorch/pytorch/pull/60299 and https://github.com/pytorch/pytorch/pull/60097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60308

Reviewed By: VitalyFedyunin

Differential Revision: D29329458

Pulled By: jbschlosser

fbshipit-source-id: d3585c4966030bc61e451f8aa817406a8a3acf47
2021-06-23 12:16:45 -07:00
Nikita Shulga
7b2d375148 Fix convolution_depthwise3x3_winograd for multichannel output (#60460)
Summary:
Before this change it was implemented with the assumption, that number of groups, input  and output channels are the same, which is not always the case
Extend the implementation to support any number of output channels as long as number of groups equals to the number of input channels (i.e. kernel.size(1) == 1)

Fixes https://github.com/pytorch/pytorch/issues/60176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60460

Reviewed By: albanD

Differential Revision: D29299693

Pulled By: malfet

fbshipit-source-id: 31130c71ce86535ccfba2f4929eee3e2e287b2f0
2021-06-23 10:38:14 -07:00
Ilqar Ramazanli
79dc500a99 Add error message for sequence length to be equal to 0 case for RNNs (#60269)
Summary:
Fixes #https://github.com/pytorch/pytorch/issues/50192

It has been discussed in the issue that, currently RNN apis do not support inputs with `seq_len=0` and the error message does not reflect this issue clearly. This PR is suggesting a solution to this issue, by adding a more clear error message that, none of RNN api (nn.RNN, nn.GRU and nn.LSTM) do not support `seq_len=0` for neither one-directional nor bi-directional layers.

```
import torch

input_size = 5
hidden_size = 6
rnn = torch.nn.GRU(input_size, hidden_size)

for seq_len in reversed(range(4)):
    output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
    print('{}, {}'.format(output.shape, h_n.shape))
```

Previously was giving output as :

```
torch.Size([3, 10, 6]), torch.Size([1, 10, 6])
torch.Size([2, 10, 6]), torch.Size([1, 10, 6])
torch.Size([1, 10, 6]), torch.Size([1, 10, 6])
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
  File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 739, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: stack expects a non-empty TensorList
```

However, after adding this PR, this error message change for any combination of
[RNN, GRU and LSTM] x [one-directional, bi-directional].

Let's illustrate the change with the following code snippet:

```
import torch

input_size = 5
hidden_size = 6
rnn = torch.nn.LSTM(input_size, hidden_size, bidirectional=True)
output, h_n = rnn(torch.zeros(0, 10, input_size))
```

would give output as following:

```
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/fsx/users/iramazanli/pytorch/torch/nn/modules/module.py", line 1054, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/users/iramazanli/pytorch/torch/nn/modules/rnn.py", line 837, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: Expected sequence length to be larger than 0 in RNN
```

***********************************

The change for Packed Sequence didn't seem to be necessary because from the following code snippet error message looks clear about the issue:

```
import torch
import torch.nn.utils.rnn as rnn_utils
import torch.nn as nn
packed = rnn_utils.pack_sequence([])
```

returns:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 398, in pack_sequence
    return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted)
  File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 363, in pad_sequence
    return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: received an empty list of sequences
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60269

Reviewed By: mrshenli

Differential Revision: D29299914

Pulled By: iramazanli

fbshipit-source-id: 5ca98faa28d4e6a5a2f7600a30049de384a3b132
2021-06-22 21:25:05 -07:00
Philip Meier
0c916c8a4e up the priority of numpy array comparisons in self.assertEqual (#59067)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58988.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0
2021-06-22 13:07:07 -07:00
Jeffrey Wan
b34965435d Improve testing of inplace views (#59891)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/49825 by improving the testing
 - Rename some of the old tests that had "inplace_view" in their names, but actually mean "inplace_[update_]on_view" so there is no confusion with the naming
 - Adds some tests in test_view_ops that verify basic behavior
 - Add tests that creation meta is properly handled for no-grad, multi-output, and custom function cases
 - Add test that verifies that in the cross dtype view case, the inplace views won't be accounted in the backward graph on rebase as mentioned in the issue.
 - Update inference mode tests to also check in-place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59891

Reviewed By: albanD

Differential Revision: D29272546

Pulled By: soulitzer

fbshipit-source-id: b12acf5f0e3f788167ebe268423cdb58481b56f6
2021-06-22 12:28:09 -07:00
Thomas J. Fan
c16f87949f ENH Adds nn.ReflectionPad3d (#59791)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/27655

This PR adds a C++ and Python version of ReflectionPad3d with structured kernels. The implementation uses lambdas extensively to better share code from the backward and forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59791

Reviewed By: gchanan

Differential Revision: D29242015

Pulled By: jbschlosser

fbshipit-source-id: 18e692d3b49b74082be09f373fc95fb7891e1b56
2021-06-21 10:53:14 -07:00
Eddie Yan
3870e68644 TF32 threshold twiddling for tests (#60209)
Summary:
Following https://github.com/pytorch/pytorch/issues/59624 I observed some straggling failing tests on Ampere due to TF32 thresholds. This PR just twiddles some more thresholds to fix the (6) failing tests I saw on A100.

CC Flamefire ptrblck ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60209

Reviewed By: gchanan

Differential Revision: D29220508

Pulled By: ngimel

fbshipit-source-id: 7c83187a246e1b3a24b181334117c0ccf2baf311
2021-06-18 11:41:33 -07:00
Alban Desmaison
5c1d17e697 Revert D29100708: [pytorch][PR] Parametrizations depending on several inputs
Test Plan: revert-hammer

Differential Revision:
D29100708 (061e71b199)

Original commit changeset: b9e91f439cf6

fbshipit-source-id: bff6d8a3d7b24f4beb976383912033c250d91a53
2021-06-14 14:08:50 -07:00
lezcano
061e71b199 Parametrizations depending on several inputs (#58488)
Summary:
Makes possible that the first register parametrization depends on a number of parameters rather than just one. Examples of these types of parametrizations are `torch.nn.utils.weight_norm` and low rank parametrizations via the multiplication of a `n x k`  tensor by a `k x m` tensor with `k <= m, n`.

Follows the plan outlined in https://github.com/pytorch/pytorch/pull/33344#issuecomment-768574924. A short summary of the idea is: we call `right_inverse` when registering a parametrization to generate the tensors that we are going to save. If `right_inverse` returns a sequence of tensors, then we save them as `original0`, `original1`...  If it returns a `Tensor` or a sequence of length 1, we save it as `original`.

We only allow to have many-to-one parametrizations in the first parametrization registered. The next parametrizations would need to be one-to-one.

There were a number of choices in the implementation:

If the `right_inverse` returns a sequence of parameters, then we unpack it in the forward. This is to allow to write code as:
```python
class Sum(nn.Module):
  def forward(self, X, Y):
    return X + Y
  def right_inverse(Z):
    return Z, torch.zeros_like(Z)
```
rather than having to unpack manually a list or a tuple within the `forward` function.

At the moment the errors are a bit all over the place. This is to avoid having to check some properties of `forward` and `right_inverse` when they are registered. I left this like this for now, but I believe it'd be better to call these functions when they are registered to make sure the invariants hold and throw errors as soon as possible.

The invariants are the following:
1. The following code should be well-formed
```python
X = module.weight
Y = param.right_inverse(X)
assert isinstance(Y, Tensor) or isinstance(Y, collections.Sequence)
Z = param(Y) if isisntance(Y, Tensor) else param(*Y)
```
in other words, if `Y` is a `Sequence` of `Tensor`s (we check also that the elements of the sequence are Tensors), then it is of the same length as the number parameters `param.forward` accepts.

2. Always: `X.dtype == Z.dtype and X.shape == Z.shape`. This is to protect the user from shooting themselves in the foot, as it's too odd for a parametrization to change the metadata of a tensor.
3. If it's one-to-one: `X.dtype == Y.dtype`. This is to be able to do `X.set_(Y)` so that if a user first instantiates the optimiser and then puts the parametrisation, then we reuse `X` and the user does not need to add a new parameter to the optimiser. Alas, this is not possible when the parametrisation is many-to-one. The current implementation of `spectral_norm` and `weight_norm` does not seem to care about this, so this would not be a regression. I left a warning in the documentation though, as this case is a bit tricky.

I'm still missing to go over the formatting of the documentation, I'll do that tomorrow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58488

Reviewed By: soulitzer

Differential Revision: D29100708

Pulled By: albanD

fbshipit-source-id: b9e91f439cf6b5b54d5fa210ec97c889efb9da38
2021-06-14 11:11:47 -07:00
Xiaomeng Yang
ff15d93b88 Improve numerical stability of GroupNorm (#54921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54921

Improve numerical stability of GroupNorm

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm"

Reviewed By: ngimel

Differential Revision: D27414438

fbshipit-source-id: 815517240ca5ea3e2beb77ced3bd862e9c83d445
2021-06-13 16:13:32 -07:00
lezcano
1f6e39336f Simplify parametrizations.SpectralNorm and improve its initialization (#59564)
Summary:
Implements a number of changes discussed with soulitzer offline.
In particular:
- Initialise `u`, `v` in `__init__` rather than in `_update_vectors`
- Initialise `u`, `v` to some reasonable vectors by doing 15 power iterations at the start
- Simplify the code of `_reshape_weight_to_matrix` (and make it faster) by using `flatten`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59564

Reviewed By: ailzhang

Differential Revision: D29066238

Pulled By: soulitzer

fbshipit-source-id: 6a58e39ddc7f2bf989ff44fb387ab408d4a1ce3d
2021-06-11 19:52:44 -07:00
mingfeima
f3218568ad optimize channels last for BatchNorm2d on CPU (#59286)
Summary:
replacement of https://github.com/pytorch/pytorch/issues/48919
optimize channels last performance for BatchNorm2 on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59286

Reviewed By: bdhirsh

Differential Revision: D29008198

Pulled By: VitalyFedyunin

fbshipit-source-id: 8a7d020bd6a42ab5c21ffe788b79a22f4ec82ac0
2021-06-11 16:30:16 -07:00
mingfeima
bb19dc14cc add channels last support for AvgPool2d on CPU (#58725)
Summary:
replacement of: https://github.com/pytorch/pytorch/pull/48918

enable test case on AvgPool2d channels last for CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58725

Reviewed By: ngimel

Differential Revision: D28593169

Pulled By: VitalyFedyunin

fbshipit-source-id: 5de870fe1d9dd961fb0dab5f9d531ab14614a160
2021-06-09 21:06:45 -07:00
Kimish Patel
c5bee1ec4f [PyTorch] Parallelize gelu via tensoriterator (#58950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58950

Use tensor iterator's API to set grain size in order to parallelize gelu op.
ghstack-source-id: 130947174

Test Plan: test_gelu

Reviewed By: ezyang

Differential Revision: D28689819

fbshipit-source-id: 0a02066d47a4d9648323c5ec27d7e0e91f4c303a
2021-06-09 16:09:38 -07:00
Alexander Grund
804f924504 Fix accuraccy failures when running test_nn on A100s (#59624)
Summary:
Make sure tests run explicitely without TF32 don't use TF32 operations

Fixes https://github.com/pytorch/pytorch/issues/52278

After the tf32 accuracy tolerance was increased to 0.05 this is the only remaining change required to fix the above issue (for TestNN.test_Conv3d_1x1x1_no_bias_cuda)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59624

Reviewed By: heitorschueroff

Differential Revision: D28996279

Pulled By: ngimel

fbshipit-source-id: 7f1b165fd52cfa0898a89190055b7a4b0985573a
2021-06-09 14:38:34 -07:00
Nikita Vedeneev
c51abf8fca Make binary_cross_entropy differentiable wrt target (#59447)
Summary:
As per title. Resolves https://github.com/pytorch/pytorch/issues/56683.
`gradgradcheck` will fail once `target.requires_grad() == True` because of the limitations of the current double backward implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59447

Reviewed By: agolynski

Differential Revision: D28910140

Pulled By: albanD

fbshipit-source-id: 20934880eb4d22bec34446a6d1be0a38ef95edc7
2021-06-07 09:20:17 -07:00
Thomas J. Fan
7f2e620105 FIX Validates that weights are 2d in embedding (#59314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59314

Reviewed By: H-Huang

Differential Revision: D28837753

Pulled By: jbschlosser

fbshipit-source-id: 683378244c61b0937c95563f91ef87ab09fd1653
2021-06-02 12:52:21 -07:00
Jagadish Krishnamoorthy
95c26b2806 [ROCm] disable test test_Conv2d_groups_nobias for ROCm (#59158)
Summary:
Disabling the test since its failing in ROCm4.2

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59158

Reviewed By: mruberry

Differential Revision: D28808953

Pulled By: ngimel

fbshipit-source-id: 134f147ead6dc559d2cde49cf8343cd976e6c224
2021-06-01 15:10:06 -07:00
Joel Schlosser
ef32a29c97 Back out "[pytorch][PR] ENH Adds dtype to nn.functional.one_hot" (#59080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59080

Original commit changeset: 3686579517cc

Test Plan: None; reverting diff

Reviewed By: albanD

Differential Revision: D28746799

fbshipit-source-id: 75a7885ab0bf3abadde9a42b56d479f71f57c89c
2021-05-27 15:40:52 -07:00
Adnios
09a8f22bf9 Add mish activation function (#58648)
Summary:
See issus: https://github.com/pytorch/pytorch/issues/58375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58648

Reviewed By: gchanan

Differential Revision: D28625390

Pulled By: jbschlosser

fbshipit-source-id: 23ea2eb7d5b3dc89c6809ff6581b90ee742149f4
2021-05-25 10:36:21 -07:00
Thomas J. Fan
a7f4f80903 ENH Adds dtype to nn.functional.one_hot (#58090)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33046
Related to https://github.com/pytorch/pytorch/issues/53785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58090

Reviewed By: zou3519

Differential Revision: D28640893

Pulled By: jbschlosser

fbshipit-source-id: 3686579517ccc75beaa74f0f6d167f5e40a83fd2
2021-05-24 13:48:25 -07:00
Joel Schlosser
c58709b7bb Helper function for skipping module parameter / buffer initialization (#57555)
Summary:
This PR introduces a helper function named `torch.nn.utils.skip_init()` that accepts a module class object + `args` / `kwargs` and instantiates the module while skipping initialization of parameter / buffer values. See discussion at https://github.com/pytorch/pytorch/issues/29523 for more context. Example usage:

```python
import torch

m = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1)
print(m.weight)

m2 = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1, device='cuda')
print(m2.weight)

m3 = torch.nn.utils.skip_init(torch.nn.Linear, in_features=5, out_features=1)
print(m3.weight)
```
```
Parameter containing:
tensor([[-3.3011e+28,  4.5915e-41, -3.3009e+28,  4.5915e-41,  0.0000e+00]],
       requires_grad=True)
Parameter containing:
tensor([[-2.5339e+27,  4.5915e-41, -2.5367e+27,  4.5915e-41,  0.0000e+00]],
       device='cuda:0', requires_grad=True)
Parameter containing:
tensor([[1.4013e-45, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
       requires_grad=True)
```

Bikeshedding on the name / namespace is welcome, as well as comments on the design itself - just wanted to get something out there for discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57555

Reviewed By: zou3519

Differential Revision: D28640613

Pulled By: jbschlosser

fbshipit-source-id: 5654f2e5af5530425ab7a9e357b6ba0d807e967f
2021-05-24 11:28:32 -07:00
Kyle Chen
52a8031e8c [ROCm] disable test test_Conv2d_groups_nobias_v2 for ROCm (#58701)
Summary:
Disable test_Conv2d_groups_nobias_v2 test because it is failing on ROCm 4.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58701

Reviewed By: ngimel

Differential Revision: D28626651

Pulled By: mruberry

fbshipit-source-id: a74bdf45335ae2afee0aa5e3bece6e208e75a63f
2021-05-23 15:43:36 -07:00
Rong Rong (AI Infra)
c1c9be16c4 port mm to structure kernel (#57755)
Summary:
relate to https://github.com/pytorch/pytorch/issues/57417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57755

Reviewed By: ezyang

Differential Revision: D28426111

Pulled By: walterddr

fbshipit-source-id: 943d3e36433ca846990b940177fb040553961156
2021-05-22 19:24:14 -07:00
Thomas J. Fan
151ec56311 ENH Adds check for input sizes in cosine_similarity (#58559)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55273

Adds check for input sizes to be consistent with the docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58559

Reviewed By: soulitzer

Differential Revision: D28562376

Pulled By: ailzhang

fbshipit-source-id: f292e8a26f11a40d146fbed94a28025794808216
2021-05-20 11:40:06 -07:00
Thomas J. Fan
ee93a348de ENH Raises nicer error when calling module.train with invalid modes (#58247)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58247

Reviewed By: ejguan

Differential Revision: D28418080

Pulled By: albanD

fbshipit-source-id: fef8f4f641ef75e801ed8b8d04c4016579aea8b0
2021-05-17 05:57:18 -07:00
Vitaly Fedyunin
49a8942a77 Revert D25399466: add channels last support for AvgPool2d on CPU
Test Plan: revert-hammer

Differential Revision:
D25399466 (8ac0917cc7)

Original commit changeset: 9477b0c281c0

fbshipit-source-id: e0245f0e390f5eca228445fd82d6e5142a827abc
2021-05-14 12:45:29 -07:00
Vitaly Fedyunin
0caec739a3 Revert D25399468: optimize channels last for BatchNorm2d on CPU
Test Plan: revert-hammer

Differential Revision:
D25399468 (0be334a1ba)

Original commit changeset: a4cd7a09cd4e

fbshipit-source-id: cef74881adcdf193355fa5a77e816addd2e2c56e
2021-05-14 12:44:14 -07:00
mingfeima
0be334a1ba optimize channels last for BatchNorm2d on CPU (#48919)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48919

move data indexing utils

parallel inference contiguous path

parallel inference channels last path

add dim apply

optimize update stats

add channels last support for backward

Revert "add channels last support for backward"

This reverts commit cc5e29dce44395250f8e2abf9772f0b99f4bcf3a.

Revert "optimize update stats"

This reverts commit 7cc6540701448b9cfd5833e36c745b5015ae7643.

Revert "add dim apply"

This reverts commit b043786d8ef72dee5cf85b5818fcb25028896ecd.

bug fix

add batchnorm nhwc test for cpu, including C=1 and HW=1

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399468

Pulled By: VitalyFedyunin

fbshipit-source-id: a4cd7a09cd4e1a8f5cdd79c7c32c696d0db386bd
2021-05-14 11:09:42 -07:00
Peter Bell
064923e635 Improve native_batch_norm_backward performance (CUDA) (#58240)
Summary:
Fixes  https://github.com/pytorch/pytorch/issues/38915

The original code uses a single kernel to do both the reduction and the elementwise backward calculations. Whereas the  `SyncBatchNorm` kernels are split, which makes them slightly slower in some cases. I try to use the fused kernel when it's beneficial, but otherwise choose the optimized channels last split kernels. There is also eval mode, where the reduction is sometimes unnecessary in which case split kernels are a win even without channels last.

Benchmarks on my system show significant speedups for channels last reductions and eval mode, with only a few minor slowdowns in training mode. These slowdowns are for 2 x 2048 shape in training, which is a small channels last inputs. But for larger batches or channels, the channels last kernels are much faster.

|N   |C   |L   |training|backward|old   |new   |cudnn |
|----|----|----|--------|--------|------|------|------|
|1   |256 |3136|TRUE    |all     |70.25 |64.93 |68.90 |
|    |    |    |TRUE    |self    |69.77 |64.61 |69.42 |
|    |    |    |FALSE   |all     |70.10 |51.12 |x     |
|    |    |    |FALSE   |self    |70.17 |51.17 |x     |
|3136|256 |    |TRUE    |all     |554.08|76.63 |549.88|
|    |    |    |TRUE    |self    |553.34|78.19 |552.36|
|    |    |    |FALSE   |all     |565.40|55.09 |x     |
|    |    |    |FALSE   |self    |565.71|54.84 |x     |
|2   |8192|1   |TRUE    |all     |155.47|47.26 |202.26|
|    |    |    |TRUE    |self    |155.46|48.36 |203.72|
|    |    |    |FALSE   |all     |178.28|40.90 |x     |
|    |    |    |FALSE   |self    |178.21|40.69 |x     |
|2   |2048|1   |TRUE    |all     |43.50 |48.21 |57.47 |
|    |    |    |TRUE    |self    |43.63 |47.24 |55.22 |
|    |    |    |FALSE   |all     |49.36 |39.27 |x     |
|    |    |    |FALSE   |self    |49.25 |42.02 |x     |
|128 |8192|1   |TRUE    |all     |762.70|106.45|336.52|
|    |    |    |TRUE    |self    |765.79|107.04|337.32|
|    |    |    |FALSE   |all     |792.68|74.94 |x     |
|    |    |    |FALSE   |self    |793.86|74.83 |x     |
|128 |2048|1   |TRUE    |all     |188.37|46.20 |85.02 |
|    |    |    |TRUE    |self    |188.47|47.57 |85.04 |
|    |    |    |FALSE   |all     |191.57|40.44 |x     |
|    |    |    |FALSE   |self    |190.13|41.55 |x     |
|2   |8192|    |TRUE    |all     |156.03|43.01 |155.19|
|    |    |    |TRUE    |self    |156.24|46.59 |156.93|
|    |    |    |FALSE   |all     |179.34|40.06 |x     |
|    |    |    |FALSE   |self    |179.20|41.85 |x     |
|2   |2048|    |TRUE    |all     |44.05 |50.15 |44.21 |
|    |    |    |TRUE    |self    |44.10 |48.97 |44.11 |
|    |    |    |FALSE   |all     |49.72 |40.95 |x     |
|    |    |    |FALSE   |self    |49.87 |43.43 |x     |
|128 |8192|    |TRUE    |all     |775.19|96.60 |777.64|
|    |    |    |TRUE    |self    |776.20|96.85 |774.21|
|    |    |    |FALSE   |all     |797.64|68.01 |x     |
|    |    |    |FALSE   |self    |806.25|68.05 |x     |
|128 |2048|    |TRUE    |all     |188.49|48.10 |188.97|
|    |    |    |TRUE    |self    |188.07|46.97 |187.98|
|    |    |    |FALSE   |all     |192.32|43.78 |x     |
|    |    |    |FALSE   |self    |193.72|40.82 |x     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58240

Reviewed By: bdhirsh

Differential Revision: D28435158

Pulled By: ngimel

fbshipit-source-id: bf62a1ee1c5d95a2caf55bee6176ae5c965688ec
2021-05-14 09:29:05 -07:00
Freey0
cf1daf571d Port silu to structured (#58050)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58050

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28382790

Pulled By: ezyang

fbshipit-source-id: 5aeedfe39b5f15d14022d1e9edec1b30e98e5076
2021-05-14 00:49:10 -07:00
Freey0
f23e10f27b Port softshrink to structured (#57623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57623

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224703

Pulled By: ezyang

fbshipit-source-id: 62e40d53eb130205f6c4d2775082e436e6adadce
2021-05-14 00:49:09 -07:00
Freey0
401d0fe8c5 Port leaky_relu to structured (#57621)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57621

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224706

Pulled By: ezyang

fbshipit-source-id: 168b175d0fd9e0cc3335ea00df4c7967fea77819
2021-05-14 00:49:05 -07:00
Freey0
9dba26eed1 Port softplus to structured (#57620)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57620

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224705

Pulled By: ezyang

fbshipit-source-id: a48419f5958e4d29427fb1fec7ff929f0297e4e4
2021-05-14 00:49:04 -07:00
Freey0
03398b7edb Port elu to structured (#57619)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57619

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28224707

Pulled By: ezyang

fbshipit-source-id: 9e1cad3f5536c65ada2d951366de134ebcb6bb3f
2021-05-14 00:47:41 -07:00
mingfeima
8ac0917cc7 add channels last support for AvgPool2d on CPU (#48918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48918

enable test case on AvgPool2d channels last for CPU

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D25399466

Pulled By: VitalyFedyunin

fbshipit-source-id: 9477b0c281c0de5ed981a97e2dcbe6072d7f0aef
2021-05-13 18:05:57 -07:00
Jeffrey Wan
e1bb9d2d99 Reimplement spectral_norm using new parametrization functionality (#57784)
Summary:
Adds a new file under `torch/nn/utils/parametrizations.py` which should contain all the parametrization implementations

For spectral_norm we add the `SpectralNorm` module which can be registered using `torch.nn.utils.parametrize.register_parametrization` or using a wrapper: `spectral_norm`, the same API the old implementation provided.

Most of the logic is borrowed from the old implementation:
 - Just like the old implementation, there should be cases when retrieving the weight should perform another power iteration (thus updating the weight) and cases where it shouldn't. For example in eval mode `self.training=True`, we do not perform power iteration.

There are also some differences/difficulties with the new implementation:
 - Using new parametrization functionality as-is there doesn't seem to be a good way to tell whether a 'forward' call was the result of parametrizations are unregistered (and leave_parametrizations=True) or when the injected property's getter was invoked. The issue is that we want perform power iteration in the latter case but not the former, but we don't have this control as-is. So, in this PR I modified the parametrization functionality to change the module to eval mode before triggering their forward call
 - Updates the vectors based on weight on initialization to fix https://github.com/pytorch/pytorch/issues/51800 (this avoids silently update weights in eval mode). This also means that we perform twice any many power iterations by the first forward.
 - right_inverse is just the identity for now, but maybe it should assert that the passed value already satisfies the constraints
 - So far, all the old spectral_norm tests have been cloned, but maybe we don't need so much testing now that the core functionality is already well tested

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57784

Reviewed By: ejguan

Differential Revision: D28413201

Pulled By: soulitzer

fbshipit-source-id: e8f1140f7924ca43ae4244c98b152c3c554668f2
2021-05-13 14:16:13 -07:00
lezcano
d8c6b74b0b Deprecate torch.solve (#57741)
Summary:
Deprecate deprecate deprecate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57741

Reviewed By: agolynski

Differential Revision: D28379337

Pulled By: mruberry

fbshipit-source-id: a7a35ce1d3f25d8593698d89761c6c2d940db31a
2021-05-13 09:54:21 -07:00
Natalia Gimelshein
e573987bea remove syncs in one_hot (#57902)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55579
Now on cuda one-hot relies on device-side asserts thrown by scatter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57902

Reviewed By: bdhirsh

Differential Revision: D28328698

Pulled By: ngimel

fbshipit-source-id: 1cd13e2c123c733cde7dbe4cbe6ff5335063bb70
2021-05-11 17:54:08 -07:00
Sigmund_Rolfsjord
8b12c8e8b3 Fixes: register_full_backward_hook crash if first argument don't require a gradient (#57944) (#57945)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57945

Reviewed By: mruberry

Differential Revision: D28351929

Pulled By: albanD

fbshipit-source-id: d0db898e6bf13d1877cd81892a5a65c7854c8102
2021-05-11 15:07:35 -07:00
Zheng Yan
ee48bd089c Support mix of int32 and int64 offsets/indices for EmbeddingBag and its variants (#55189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55189

Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.

Test Plan: unit tests

Reviewed By: allwu

Differential Revision: D27482738

fbshipit-source-id: deeadd391d49ff65d17d016092df1839b82806cc
2021-05-10 23:23:50 -07:00
Thomas J. Fan
3ec16035f2 TST Migrates some of test_nn.py from assertEqualIgnoreTypes to assertEqual (#57642)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38095, https://github.com/pytorch/pytorch/issues/50006

Migrates some of `test_nn.py` from `assertEqualIgnoreTypes` to `assertEqual`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57642

Reviewed By: bdhirsh

Differential Revision: D28317761

Pulled By: mruberry

fbshipit-source-id: 6bea6f669569922b2a391d1523917edde976f014
2021-05-10 23:10:29 -07:00
Richard Zou
0787d781c5 Fix compatibility problem with LSTMs and torch.save (#57558)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57558

Fixes #53359

If someone directly saves an nn.LSTM in PyTorch 1.7 and then loads it in PyTorch
1.8, it errors out with the following:
```
(In PyTorch 1.7)
import torch
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')

(In PyTorch 1.8)
model = torch.load('lstm17.pt')
AttributeError: 'LSTM' object has no attribute 'proj_size'
```

Although we do not officially support this (directly saving modules via
torch.save), it used to work and the fix is very simple. This PR adds an
extra line to `__setstate__`: if the state we are passed does not have
a `proj_size` attribute, we assume it was saved from PyTorch 1.7 and
older and set `proj_size` equal to 0.

Test Plan:
I wrote a test that tests `__setstate__`. But also,

Run the following:
```
(In PyTorch 1.7)
import torch
x = torch.ones(32, 5, 2)
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')
y17 = model(x)

(Using this PR)
model = torch.load('lstm17.pt')
x = torch.ones(32, 5, 2)
y18 = model(x)
```
and finally compare y17 and y18.

Reviewed By: mrshenli

Differential Revision: D28198477

Pulled By: zou3519

fbshipit-source-id: e107d1ebdda23a195a1c3574de32a444eeb16191
2021-05-05 07:36:13 -07:00
Xiao Wang
ac72881f3f Fix a numerical issue of CUDA channels-last SyncBatchNorm (#57077)
Summary:
Fix a numerical issue of CUDA channels-last SyncBatchNorm

The added test is a repro for the numerical issue. Thanks for the help from jjsjann123 who identified the root cause. Since pytorch SBN channels-last code was migrated from [nvidia/apex](https://github.com/nvidia/apex), apex SBN channels-last also has this issue. We will submit a fix there soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57077

Reviewed By: mruberry

Differential Revision: D28107672

Pulled By: ngimel

fbshipit-source-id: 0c80e79ddb48891058414ad8a9bedd80f0f7f8df
2021-04-29 21:38:52 -07:00
Joel Schlosser
f7fba854bf Implement module.to_empty() (#56610)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56610

Reviewed By: malfet

Differential Revision: D27921653

Pulled By: jbschlosser

fbshipit-source-id: 10734b3eaa5b84bb4ba6eeba1043cfc8bb570a17
2021-04-27 06:19:54 -07:00
Xiao Wang
7b31ba4708 Fix cudnn ctc loss backward (#56639)
Summary:
Fix cudnn ctc loss backward

Fix https://github.com/pytorch/pytorch/issues/49046, which was working in pytorch 1.1

Originally modified in this PR in Oct 2019, https://github.com/pytorch/pytorch/pull/27039/files#diff-25ec2c1108ee03e2167622588ec31d167897ef1cccb12a4cfe77eb98777316daR2383-R2392

According to the original code

90ffab6e37/tools/autograd/derivatives.yaml (L1387-L1388)

and the code after PR

f461184505/tools/autograd/templates/Functions.cpp (L2456-L2465)

This `at::zeros({0}, raw_grad.options())` in line 2460 seems suspicious, and is causing `infer_size` runtime error

```
RuntimeError: The size of tensor a (0) must match the size of tensor b (177) at non-singleton dimension 2
Exception raised from infer_size at ..\aten\src\ATen\ExpandUtils.cpp:24 (most recent call first):
```

I've modified that to `at::zeros_like(raw_grad)`, which looks more accurate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56639

Reviewed By: mruberry

Differential Revision: D27987860

Pulled By: ngimel

fbshipit-source-id: 5ad65e78d017c26894fb26318a5992b0878d04d5
2021-04-25 22:51:19 -07:00
Joel Schlosser
7d2a9f2dc9 Fix instance norm input size validation + test (#56659)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45687

Fix changes the input size check for `InstanceNorm*d` to be more restrictive and correctly reject sizes with only a single spatial element, regardless of batch size, to avoid infinite variance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56659

Reviewed By: pbelevich

Differential Revision: D27948060

Pulled By: jbschlosser

fbshipit-source-id: 21cfea391a609c0774568b89fd241efea72516bb
2021-04-23 10:53:39 -07:00
albanD
22b151a3ba Make sure full backward hook fire when no input requires grad (#56693)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56380

BC-breaking note:
This changes the behavior of full backward hooks as they will now fire properly even if no input to the Module require gradients.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56693

Reviewed By: ezyang

Differential Revision: D27947030

Pulled By: albanD

fbshipit-source-id: e8353d769ba5a2c1b6bdf3b64e2d61308cf624a2
2021-04-23 08:46:49 -07:00
Joel Schlosser
e5fda07e80 Fix: Compare input against beta * threshold in softplus backwards (#56484)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55587

The fix converts the binary `TensorIterator` used by softplus backwards to a ternary one, adding in the original input for comparison against `beta * threshold`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56484

Reviewed By: malfet

Differential Revision: D27908372

Pulled By: jbschlosser

fbshipit-source-id: 73323880a5672e0242879690514a17886cbc29cd
2021-04-23 07:58:51 -07:00
Kurt Mohler
1f04494c0e Consolidate nondeterministic error tests (#55631)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55631

Reviewed By: malfet

Differential Revision: D27909953

Pulled By: mruberry

fbshipit-source-id: 9115b2433f9c276555be55bd51b270a7a2846829
2021-04-22 23:37:01 -07:00
Jeffrey Wan
d01302431c Enable fast gradcheck for real inputs and outputs (#55237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55237

In this PR, we reenable fast-gradcheck and resolve misc issues that arise:
Before landing this PR, land #55182 so that slow tests are still being run periodically.

Bolded indicates the issue is handled in this PR, otherwise it is handled in a previous PR.

**Non-determinism issues**:
- ops that do not have deterministic implementation (as documented https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms)
  - test_pad_cuda (replication_pad2d) (test_nn)
  - interpolate (test_nn)
  - cummin, cummax (scatter_add_cuda_kernel) (test_ops)
  - test_fn_gradgrad_prod_cpu_float64 (test_ops)

Randomness:
  - RRelu (new module tests) - we fix by using our own generator as to avoid messing with user RNG state (handled in #54480)

Numerical precision issues:
- jacobian mismatch: test_gelu (test_nn, float32, not able to replicate locally) - we fixed this by disabling for float32 (handled in previous  PR)
- cholesky_solve (test_linalg): #56235 handled in previous PR
- **cumprod** (test_ops) - #56275 disabled fast gradcheck

Not yet replicated:
 - test_relaxed_one_hot_categorical_2d (test_distributions)

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27920906

fbshipit-source-id: 894dd7bf20b74f1a91a5bc24fe56794b4ee24656
2021-04-22 19:46:37 -07:00
Jeffrey Wan
2ea3c24c06 Disable flaky tests (#56279)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56279

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27916606

Pulled By: soulitzer

fbshipit-source-id: 60c07024f6eb818f4aa6730a5f9ff90d7bc2b80f
2021-04-22 19:45:41 -07:00
Nikita Shulga
9be2cabc45 Pass contiguous weight to NNPACK convolution (#56569)
Summary:
Added TestNN.test_conv2d_discontiguous_weight to prevent further regressions

Fixes https://github.com/pytorch/pytorch/issues/55781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56569

Reviewed By: ngimel

Differential Revision: D27926509

Pulled By: malfet

fbshipit-source-id: fa5ce943c3e4db4aa4de1b1cba35bd399fb3c54d
2021-04-22 08:45:24 -07:00