**Previous behavior**: compute inner product, then normalize.
**This patch**: first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms.
By design ensures that cosine similarity is within `[-1.0, +1.0]`, so it should fix [#29442](https://github.com/pytorch/pytorch/issues/29442).
P.S. I had to change tests because this implementation handles division by 0 differently.
This PR computes cosine similarity as follows: <x/max(eps, ||x||), y/max(eps, ||y||)>.
Let f(x,y) = <x,y>/(||x|| * ||y||), then
df/dx = y/(||x|| * ||y||) - (||y||/||x|| * <x,y> * x)/(||x|| * ||y||)^2.
The changed test checks division by zero in backward when x=0 and y != 0.
For this case the non-zero part of the gradient is just y / (||x|| * ||y||).
The previous test evaluates y/(||x|| * ||y||) to y / eps, and this PR to 1/eps * y/||y||.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31378
Approved by: https://github.com/ezyang, https://github.com/albanD
Summary:
In this PR, we try to optimize PReLU op in CPU path, and enable BFloat16 support based on the optimized PReLU.
The original implementation uses parallel_for to accelerate operation speed, but vectorization is not used. It can be optimized by using TensorIterator, both including parallelization and vectorization.
The difference between PReLU and other activation function ops, is that PReLU supports a learnable parameter `weight`. When called without arguments, nn.PReLU() uses a single parameter `weight` across all input channels. If called with nn.PReLU(nChannels), a separate `weight` is used for each input channel. So we cannot simply use TensorIterator because `weight` is different for each input channel.
In order to use TensorIterator, `weight` should be broadcasted to `input` shape. And with vectorization and parallel_for, this implementation is much faster than the original one. Another advantage is, don't need to separate `share weights` and `multiple weights` in implementation.
We test the performance between the PReLU implementation of public Pytorch and the optimized PReLU in this PR, including fp32/bf16, forward/backward, share weights/multiple weights configurations. bf16 in public Pytorch directly reuses `Vectorized<scalar_t>` for `BFloat16`.
Share weights:


Multiple weights:


cc albanD mruberry jbschlosser walterddr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63634
Reviewed By: yinghai
Differential Revision: D34031616
Pulled By: frank-wei
fbshipit-source-id: 04e2a0f9e92c658fba7ff56b1010eacb7e8ab44c
(cherry picked from commit ed262b15487557720bb0d498f9f2e8fcdba772d9)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75421
As part of FSDP work, we will be relying on `_register_load_state_dict_pre_hook` to manage some specific logic related to loading state dicts.
This PR adds a test to ensure that _register_load_state_dict_pre_hook can be
used to register hooks on modules that will be used in a nested way, and then
calling load_state_dict on the overall module still calls those hooks
appropriately.
Differential Revision: [D35434726](https://our.internmc.facebook.com/intern/diff/D35434726/)
Approved by: https://github.com/albanD
Summary: The primary issue for enabling sparsity to work with QAT
convert (unlike normal quantization convert) is that when the
parametrized module undergoes the QAT convert, the parametrizations need
to be maintained. If the parametrizations don't
get transfered during the convert, the sparsifier would lose its
connection to the model. In practice this was handled using the
transfer_parametrizations_and_params function to move the weight and
bias and any associated paramerizations to the new module. This PR also adds
tests for transfer_parametrizations_and_params and type_before_parametrizations
to test_nn.py and also added comments to the test code for
composability.
Test Plan: python test/test_ao_sparsity.py TestComposability
python test/test_nn.py TestNN
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74848
Approved by: https://github.com/vkuzo, https://github.com/Lezcano
Summary:
Add BFloat16 support for logsigmoid, hardsigmoid, hardshrink, softshrink, hardswish and softplus on CPU, and optimize the performance of softshrink.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63134
Reviewed By: yinghai
Differential Revision: D34897992
Pulled By: frank-wei
fbshipit-source-id: 4c778f5271d6fa54dd78158258941def8d9252f5
(cherry picked from commit decda0e3debf56cc5c4d7faea41b1165a7cabe12)
For a GroupNorm module, if num_channels is not divisible by num_groups, we need to report an error when defining a module other than at the running step.
example:
```
import torch
m = torch.nn.GroupNorm(5, 6)
x = torch.randn(1, 6, 4, 4)
y = m(x)
```
before:
```
Traceback (most recent call last):
File "group_norm_test.py", line 8, in <module>
y = m(x)
File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1111, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 271, in forward
input, self.num_groups, self.weight, self.bias, self.eps)
File "/home/xiaobinz/miniconda3/envs/pytorch_mater/lib/python3.7/site-packages/torch/nn/functional.py", line 2500, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected number of channels in input to be divisible by num_groups, but got input of shape [1, 6, 4, 4] and num_groups=5
```
after:
```
Traceback (most recent call last):
File "group_norm_test.py", line 6, in <module>
m = torch.nn.GroupNorm(5, 6)
File "/home/xiaobinz/miniconda3/envs/pytorch_test/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 251, in __init__
raise ValueError('num_channels must be divisible by num_groups')
```
This PR also update the doc of num_groups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74293
Approved by: https://github.com/jbschlosser
Fixes#71415
I have implemented the changes that replicate what @to-mi did in this [PR](https://github.com/pytorch/pytorch/pull/65986#issue-1012959443) for the 3D case :
> Fixes#64977
>
> Avoids creating a tensor for and calculating `input` gradient if it's not needed in the backward pass of `grid_sample` (2d case, native CPU & CUDA kernels). Especially the tensor creation seemed time consuming (see #64977).
>
> Brief description of the changes:
>
> * I have tried to go with rather minimal changes. It would probably be possible to make a more elegant version with a bit larger refactoring (or possibly with better understanding of PyTorch internals and C++ functionalities).
>
> * Changed the `native_functions.yaml` and `derivatives.yaml` so that the gradient input mask is passed to the functions.
>
> * Changed the CPU kernels:
> (1) added `bool input_requires_grad` template parameter to the `backward` function,
> (2) added if branches based on it to remove `input` gradient computations if it's not requested,
> (3) feed in `TensorAccessor<scalar_t, 3>* gInp_slice_ptr` instead of `TensorAccessor<scalar_t, 3>& gInp_slice` so that I can pass a `nullptr` in case gradient for `input` is not requested. (A bit inelegant perhaps, but allows to keep one signature for `backward` function and not require breaking it to smaller pieces. Perhaps there's a more elegant way to achieve this?)
>
> * Changed CUDA kernel:
> (1) added ~`bool input_requires_grad` template parameter~ `const bool input_requires_grad` argument to the `backward` function,
> (2) added if branches based on it to remove `input` gradient computations if it's not requested,
> (3) feed in `TensorInfo<scalar_t, index_t>()` instead of `getTensorInfo<scalar_t, index_t>(grad_input)` in case gradient for `input` is not requested.
>
> * Modified tests in `test/test_nn.py` so that they run also cases with no `input` gradient needed.
>
> * Have not touched the CPU fallback kernel.
Note: the changes number (3) are N/A in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71759
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72941
Simple test for MHA, use cos similarity as metric since scaling generate mismatch. Cuda is validated, CPU fix a following (We can land this with onlyCuda flag, and remove it once CPU is also done)
Test Plan:
For cuda:
buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_native_multihead_attention_cuda_float32 2>&1 | pastry
Reviewed By: swolchok
Differential Revision: D33906921
fbshipit-source-id: ad447401eb7002f22ed533d620a6b544524b3f58
(cherry picked from commit 45b778da27)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72944
Doesn't make sense to develop it in core right now.
ghstack-source-id: 149456040
Test Plan:
CI
run MHA benchmark in benchmark_transformers.py to make sure it doesn't crash
Reviewed By: zrphercule
Differential Revision: D34283104
fbshipit-source-id: 4f0c7a6bc066f938ceac891320d4cf4c3f8a9cd6
(cherry picked from commit b9df65e97c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72671
The existing kernel did not handle cases where D % 4 != 0 or dim_per_head % 4 != 0. Now we have a non-vectorized kernel for these cases.
ghstack-source-id: 149201477
Test Plan: Updated test_nn to cover these cases.
Reviewed By: zrphercule, ngimel
Differential Revision: D34119371
fbshipit-source-id: 4e9b4d9b636224ef2c433593f6f236df040de782
(cherry picked from commit f5393878e4)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72464
We had some trouble getting this component (and this test!) right, so let's test it.
ghstack-source-id: 149201478
Test Plan: new test passes
Reviewed By: zrphercule
Differential Revision: D33992477
fbshipit-source-id: cc377eed5d4a4412b42bdabf360601c6e52947cf
(cherry picked from commit 9832867b12)
Summary:
https://github.com/pytorch/pytorch/issues/71521 attempted to fix an issue where the `test_conv_large` test was producing `NaN` values after the backward pass, yielding a bogus comparison between the result and the expected result. While tweaking the initialization of the conv layer seemed to fix this behavior, it was actually just masking the real issue, which was that `grad_weight` is not guaranteed to be initialized in `raw_cudnn_convolution_backward_weight_out` when the backward operation is split.
Specifically, the `grad_weight` tensor is expected to be directly written to by a `cudnn` kernel (which does occur in most cases) so it does not need to be initialized, but splitting introduces an intermediate `grad_weight_` tensor that holds the intermediate gradients and then accumulates into `grad_weight` without initializing it first. This PR tweaks this behavior so that now accumulation is done with a zero'd tensor, and also adds the change of doing the accumulation in an accumulation dtype. The hacky workaround masking the issue is also reverted, with the safeguard against comparing `NaN` values (using the reference tensor for scale computation) kept in place.
CC ngimel ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72157
Reviewed By: malfet
Differential Revision: D34147547
Pulled By: ngimel
fbshipit-source-id: 056c19f727eeef96347db557528272e24eae4223
(cherry picked from commit 24c7f77a81)
Summary:
The only difference with plain list/dict now is that nn.Parameters are
handled specially and registered as parameters properly.
test_nn and parametrization works locally.
Will see in CI if DP is fixed as well.
Tentative fix for https://github.com/pytorch/pytorch/issues/36035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70499
Reviewed By: jbschlosser, alexeib
Differential Revision: D34005332
Pulled By: albanD
fbshipit-source-id: 7e76b0873d0fec345cb537e2a6ecba0258e662b9
(cherry picked from commit dc1e6f8d86)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71720
This PR removes the old warnings for `recompute_scale_factor` and `align_corners`.
Looking at this, I realize that the tests I modified don't really catch whether or not a warning is created for `recompute_scale_factor`. If desired, I can add a couple lines into the tests there to pass a floating point in the `scale_factors` kwarg, along with `recompute_scale_factor=None`.
Let me know how this looks, thanks so much!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72093
Reviewed By: mruberry
Differential Revision: D33917615
Pulled By: albanD
fbshipit-source-id: e822f0a15b813ecf312cdc6ed0b693e7f1d1ca89
(cherry picked from commit c14852b85c)
Summary:
Pull Request resolved: https://github.com/pytorch/torchrec/pull/39
Pull Request resolved: https://github.com/facebookresearch/torchrec/pull/6
This makes it so that shared parameters get their own entry in `named_parameters`.
More broadly, this makes it so that
```
params_and_buffers = {**mod.named_named_parameters(remove_duplicate=False), **mod.named_buffers(remove_duplicate=False)}
_stateless.functional_call(mod, params_and_buffers, args, kwargs)
```
is identical to calling the original module's forwards pass.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71542
Reviewed By: jbschlosser, albanD
Differential Revision: D33716716
Pulled By: Chillee
fbshipit-source-id: ff1ed9980bd1a3f7ebaf695ee5e401202b543213
(cherry picked from commit d6e3ad3cd0)
Summary:
Hi,
The PR fixes https://github.com/pytorch/pytorch/issues/71096. It aims to scan all the test files and replace ` ALL_TENSORTYPES` and `ALL_TENSORTYPES2` with `get_all_fp_dtypes`.
I'm looking forward to your viewpoints!
Thanks!
cc: janeyx99 kshitij12345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71153
Reviewed By: jbschlosser, mruberry
Differential Revision: D33533346
Pulled By: anjali411
fbshipit-source-id: 75e79ca2756c1ddaf0e7e0289257fca183a570b3
(cherry picked from commit da54b54dc5)
Summary:
This PR twiddles the parameters of the conv layer in `test_conv_large` to better avoid NaN values. Previously, this test would cause a NaN to be computed for `scale` (propagated from `.mean()` on the `.grad` tensor). This NaN would then be propagated to the scaled gradients via division, resulting in a bogus `assertEqual` check as `NaN == NaN` is by default true. (This behavior was observed on V100 and A100).
To improve visibility of failures in the event of NaNs in `grad1`, scale is now computed from `grad2`.
Interestingly enough, we discovered this issue when trying out some less common setups that broke this test; it turns out those breakages were cases where there were no NaN values (leading to an actual `assertEqual` check that would fail for `float16`).
CC ptrblck ngimel puririshi98
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71521
Reviewed By: anjali411
Differential Revision: D33776705
Pulled By: ngimel
fbshipit-source-id: a1ec4792cba04c6322b22ef5b80ce08579ea4cf6
(cherry picked from commit d207bd9b87)
Summary:
We found a discrepancy between cpu & CUDA when using RNN modules where input shapes containing 0s would cause an invalid configuration argument error in CUDA (kernel grid size is 0), while returning a valid tensor in CPU cases.
A reproducer:
```
import torch
x = torch.zeros((5, 0, 3)).cuda()
gru = torch.nn.GRU(input_size=3, hidden_size=4).to("cuda")
gru(x)
```
Run with `CUDA_LAUNCH_BLOCKING=1` set.
cc ngimel albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71696
Reviewed By: mikaylagawarecki
Differential Revision: D33743674
Pulled By: ngimel
fbshipit-source-id: e9334175d10969fdf1f9c63985910d944bbd26e7
(cherry picked from commit 70838ba69b)
Summary:
Helps fix a part of https://github.com/pytorch/pytorch/issues/69865
The first commit just migrates everything as is.
The second commit uses the "device" variable instead of passing "cuda" everywhere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70872
Reviewed By: jbschlosser
Differential Revision: D33455941
Pulled By: janeyx99
fbshipit-source-id: 9d9ec8c95f1714c40d55800e652ccd69b0c314dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69727
Still need to test the backward ones. We would need to update gradgradcheck to check forward over backward.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33031728
Pulled By: soulitzer
fbshipit-source-id: 86c59df5d2196b5c8dbbb1efed9321e02ab46d30
Summary:
Fixes https://github.com/pytorch/pytorch/issues/68476
We implemented all of the following `dict` methods for `ParameterDict`
- `get `
- `setdefault`
- `popitem`
- `fromkeys`
- `copy`
- `__or__`
- `__ior__`
- `__reversed__`
- `__ror__`
The behavior of these new methods matches the expected behavior of python `dict` as defined by the language itself: https://docs.python.org/3/library/stdtypes.html#typesmapping
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69403
Reviewed By: albanD
Differential Revision: D33187111
Pulled By: jbschlosser
fbshipit-source-id: ecaa493837dbc9d8566ddbb113b898997e2debcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69272
In transformer encoder and MHA, masked_softmax's mask is a 2D tensor (B, D), where input is a 4D tensor (B, H, D, D).
This mask could be simply broadcasted to a (B, H, D, D) like input, and then do a regular masked_softmax, however it will bring the problem of non-contiguous mask & consume more memory.
In this diff, we maintained mask's shape unchanged, while calc the corresponding mask for input in each cuda thread.
This new layout is not currently supported in CPU yet.
Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax
Reviewed By: ngimel
Differential Revision: D32605557
fbshipit-source-id: ef37f86981fdb2fb264d776f0e581841de5d68d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69268
This diff enabled native masked softmax on CUDA, also expanded our current warp_softmax to accept masking.
The mask in this masked softmax has to be the same shape as input, and has to be contiguous.
In a following diff I will submit later, I will have encoder mask layout included, where input is BHDD and mask is BD.
Test Plan: buck build mode/opt -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/gen/caffe2/test/nn\#binary.par -r test_masked_softmax
Reviewed By: ngimel
Differential Revision: D32338419
fbshipit-source-id: 48c3fde793ad4535725d9dae712db42e2bdb8a49
Summary:
Towards [convolution consolidation](https://fb.quip.com/tpDsAYtO15PO).
Introduces the general `convolution_backward` function that uses the factored-out backend routing logic from the forward function.
Some notes:
* `finput` is now recomputed in the backward pass for the slow 2d / 3d kernels instead of being saved from the forward pass. The logic for is based on the forward computation and is present in `compute_finput2d` / `compute_finput3d` functions in `ConvUtils.h`.
* Using structured kernels for `convolution_backward` requires extra copying since the backend-specific backward functions return tensors. Porting to structured is left as future work.
* The tests that check the routing logic have been renamed from `test_conv_backend_selection` -> `test_conv_backend` and now also include gradcheck validation using an `autograd.Function` hooking up `convolution` to `convolution_backward`. This was done to ensure that gradcheck passes for the same set of inputs / backends.
The forward pass routing is done as shown in this flowchart (probably need to download it for it to be readable since it's ridiculous):


Pull Request resolved: https://github.com/pytorch/pytorch/pull/65219
Reviewed By: mruberry
Differential Revision: D32611368
Pulled By: jbschlosser
fbshipit-source-id: 26d759b7c908ab8f19ecce627acea7bd3d5f59ba
Summary:
Adds native_dropout to have a reasonable target for torchscript in auto diff. native_dropout has scale and train as arguments in its signature, this makes native_dropout more consistent with other operators and removes conditionals in the autodiff definition.
cc gmagogsfm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63937
Reviewed By: mruberry
Differential Revision: D32477657
Pulled By: ngimel
fbshipit-source-id: d37b137a37acafa50990f60c77f5cea2818454e4
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53647
With this if a test forgets to add `dtypes` while using `dtypesIf`, following error is raised
```
AssertionError: dtypes is mandatory when using dtypesIf however 'test_exponential_no_zero' didn't specify it
```
**Tested Locally**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68186
Reviewed By: VitalyFedyunin
Differential Revision: D32468581
Pulled By: mruberry
fbshipit-source-id: 805e0855f988b77a5d8d4cd52b31426c04c2200b
Summary:
This PR introduces a new function `_select_conv_backend` that returns a `ConvBackend` enum representing the selected backend for a given set of convolution inputs and params.
The function and enum are exposed to python for testing purposes through `torch/csrc/Module.cpp` (please let me know if there's a better place to do this).
A new set of tests validates that the correct backend is selected for several sets of inputs + params. Some backends aren't tested yet:
* nnpack (for mobile)
* xnnpack (for mobile)
* winograd 3x3 (for mobile)
Some flowcharts for reference:


Pull Request resolved: https://github.com/pytorch/pytorch/pull/67790
Reviewed By: zou3519
Differential Revision: D32280878
Pulled By: jbschlosser
fbshipit-source-id: 0ce55174f470f65c9b5345b9980cf12251f3abbb
Summary:
This PR makes several changes:
- Changed function `bool cudnn_conv_use_channels_last(...)` to `at::MemoryFormat cudnn_conv_suggest_memory_format(...)`
- Removed `resize_` in cudnn convolution code. Added a new overloading method `TensorDescriptor::set` that also passes the desired memory format of the tensor.
- Disabled the usage of double + channels_last on cuDNN Conv-Relu and Conv-Bias-Relu. Call `.contiguous(memory_format)` before passing data to cuDNN functions.
- Disabled the usage of cuDNN fused Conv-Bias-Relu in cuDNN < 8.0 version due to a CUDNN_STATUS_NOT_SUPPORTED error. Instead, use the native fallback path.
- Let Conv-Bias-Relu code respect the global `allow_tf32` flag.
From cuDNN document, double + NHWC is genenrally not supported.
Close https://github.com/pytorch/pytorch/pull/66968
Fix https://github.com/pytorch/pytorch/issues/55301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65594
Reviewed By: jbschlosser, malfet
Differential Revision: D32175766
Pulled By: ngimel
fbshipit-source-id: 7ba079c9f7c46fc56f8bfef05bad0854acf380d7
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/66066
This PR:
- cleans up op-specific testing from test_autograd. test_autograd should be reserved for testing generic autograd functionality
- tests related to an operator are better colocated
- see the tracker for details
What to think about when moving tests to their correct test suite:
- naming, make sure its not too generic
- how the test is parametrized, sometimes we need to add/remove a device/dtype parameter
- can this be merged with existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67413
Reviewed By: jbschlosser, albanD
Differential Revision: D32031480
Pulled By: soulitzer
fbshipit-source-id: 8e13da1e58a38d5cecbfdfd4fe2b4fe6f816897f
Summary:
Fix https://github.com/pytorch/pytorch/issues/67239
The CUDA kernels for `adaptive_max_pool2d` (forward and backward) were written for contiguous output. If outputs are non-contiguous, first create a contiguous copy and let the kernel write output to the contiguous memory space. Then copy the output from contiguous memory space to the original non-contiguous memory space.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67697
Reviewed By: ejguan
Differential Revision: D32112443
Pulled By: ngimel
fbshipit-source-id: 0e3bf06d042200c651a79d13b75484526fde11fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66879
This adds a quantized implementation for bilinear gridsample. Bicubic interpolation cannot be supported as easily since we rely on the linearity of quantization to operate on the raw values, i.e.
f(q(a), q(b)) = q(f(a, b)) where f is the linear interpolation function.
ghstack-source-id: 141321116
Test Plan: test_quantization
Reviewed By: kimishpatel
Differential Revision: D31656893
fbshipit-source-id: d0bc31da8ce93daf031a142decebf4a155943f0f
Summary:
Removes the 3D special case logic in `_convolution_double_backward()` that never worked.
The logic was never called previously since `convolution()` expands input / weight from 3D -> 4D before passing them to backends; backend-specific backward calls thus save the 4D version to pass to `_convolution_double_backward()`.
The new general `convolution_backward()` saves the original 3D input / weight, uncovering the bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67283
Reviewed By: anjali411
Differential Revision: D32021100
Pulled By: jbschlosser
fbshipit-source-id: 0916bcaa77ef49545848b344d6385b33bacf473d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181
This PR replaces all the calls to:
- `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python
- `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python.
It also simplifies two pieces of code, and fixes one bug where a pair
of parentheses were missing in the function `make_symmetric_matrices`.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D31692896
Pulled By: anjali411
fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64572
Fixes https://github.com/pytorch/pytorch/issues/64256
It also fixes an inconsistent treatment of the case `reduction = "mean"`
when the whole target is equal to `ignore_index`. It now returns `NaN`
in this case, consistently with what it returns when computing the mean
over an empty tensor.
We add tests for all these cases.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D31116297
Pulled By: albanD
fbshipit-source-id: cc44e79205f5eeabf1efd7d32fe61e26ba701b52
Summary:
- Added 2D-Convolution NHWC support
- on ROCm 4.3, with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` flag
- May need to force MIOpen to search for solutions ( see examples below for flags )
**PYTORCH_MIOPEN_SUGGEST_NHWC Environment Flag**
MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. This flag is intended to be a short-lived flag to explicitly turn on NHWC support until ROCm officially supports NHWC and performance is verified.
**Examples**
1. Example usage 1 : Run test on ROCm4.3
`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_MIOPEN_SUGGEST_NHWC=1 MIOPEN_FIND_ENFORCE=4 MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_MODE=1 pytest test_nn.py -v -k "test_conv_cudnn_nhwc" `
2. Example usage 2: Run the following with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` on ROCm4.3.
```
#!/usr/bin/env python3
import torch
model = torch.nn.Conv2d(8, 4, 3).cuda().half()
model = model.to(memory_format=torch.channels_last)
input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)
# should print True for is_contiguous(channels_last), and strides must match NHWC format
print(input.is_contiguous(memory_format=torch.channels_last), input.shape, input.stride() )
out = model(input)
# should print True for is_contiguous(channels_last), and strides must match NHWC format
print("Contiguous channel last :", out.is_contiguous(memory_format=torch.channels_last), " out shape :", out.shape, "out stride :", out.stride() )
```
See https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html for more examples.
cc jeffdaily sunway513 jithunnair-amd ROCmSupport
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63617
Reviewed By: saketh-are
Differential Revision: D30730800
Pulled By: ezyang
fbshipit-source-id: 61906a0f30be8299e6547d312ae6ac91cc7c3238
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554
Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:
1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.
We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D30662206
Pulled By: mruberry
fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64385
It was deleted in https://github.com/pytorch/pytorch/pull/63276.
The numerics test was meant to check LayerNorm behavior on large inputs,
but we deleted it without realizing that.
Test Plan: - wait for tests.
Reviewed By: ngimel
Differential Revision: D30702950
Pulled By: zou3519
fbshipit-source-id: a480e26c45ec38fb628938b70416cdb22d976a46
Summary:
Implements an orthogonal / unitary parametrisation.
It does passes the tests and I have trained a couple models with this implementation, so I believe it should be somewhat correct. Now, the implementation is very subtle. I'm tagging nikitaved and IvanYashchuk as reviewers in case they have comments / they see some room for optimisation of the code, in particular of the `forward` function.
Fixes https://github.com/pytorch/pytorch/issues/42243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62089
Reviewed By: ezyang
Differential Revision: D30639063
Pulled By: albanD
fbshipit-source-id: 988664f333ac7a75ce71ba44c8d77b986dff2fe6
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64039
There are two distinct problems here.
1. If `grad_output` is channels last but not input, then input would be read as-if it were channels last. So reading the wrong values.
2. `use_channels_last_kernels` doesn't guarunte that `suggest_memory_format` will actually return channels last, so use `empty_like` instead so the strides always match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64100
Reviewed By: mruberry
Differential Revision: D30622127
Pulled By: ngimel
fbshipit-source-id: e28cc57215596817f1432fcdd6c49d69acfedcf2
Summary:
I think the original intention here is to only take effect in the case of align_corners (because output_size = 1 and the divisor will be 0), but it affects non-align_corners too. For example:
```python
input = torch.tensor(
np.arange(1, 5, dtype=np.int32).reshape((1, 1, 2, 2)) )
m = torch.nn.Upsample(scale_factor=0.5, mode="bilinear")
of_out = m(input)
```
The result we expect should be [[[[2.5]]]]
but pytorch get [[[[1.0]]]] which is different from OpenCV and PIL, this pr try to fixed it。
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61166
Reviewed By: malfet
Differential Revision: D30543178
Pulled By: heitorschueroff
fbshipit-source-id: 21a4035483981986b0ae4a401ef0efbc565ccaf1
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62094
Introduces functionality for adding arbitrary objects to module state_dicts. To take advantage of this, the following functions can be defined on a module:
* `get_extra_state(self) -> dict` - Returns a dict defining any extra state this module wants to save
* `set_extra_state(self, state)` - Subsumes the given state within the module
In the details, a sub-dictionary is stored in the state_dict under the key `_extra_state` for each module that requires extra state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62976
Reviewed By: heitorschueroff
Differential Revision: D30518657
Pulled By: jbschlosser
fbshipit-source-id: 5fb35ab8e3d36f35e3e96dcd4498f8c917d1f386
Summary:
Interestingly enough, the original code did have a mechanism that aims to prevent this very issue:
but it performs a clone AFTER modifying u and v in-place.
This wouldn't work though because we can later use the cloned u and v in operations that save for backward, and the next time we execute forward, we modify the same cloned u and v in-place.
So if the idea is that we want to avoid modifying saved variable in-place we should clone it BEFORE the in-place operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62293
Reviewed By: bdhirsh
Differential Revision: D30489750
Pulled By: soulitzer
fbshipit-source-id: cbe8dea885aef97adda8481f7a822e5bd91f7889
Summary:
As discussed here https://github.com/pytorch/pytorch/pull/62897, in the path of BF16/non-last-dim Softmax, we miss the subtractions of max value which will cause the overflow in the `exp()` calculation when the value of input tensor is large, such as `1000.0`.
To avoid this issue, we add the subtractions of max value and the corresponding test cases in this PR.
Note w/o subtractions of max value(accidental reverts or changes), we will get the underlying error message of the test case
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.05 and atol=0.05, found 103984 element(s) (out of 126720) whose difference(s) exceeded the margin of error (including 103984 nan comparisons). The greatest difference was nan (0.0 vs. nan), which occurred at index (0, 0, 0, 1).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63132
Reviewed By: VitalyFedyunin
Differential Revision: D30280792
Pulled By: cpuhrsch
fbshipit-source-id: 722821debf983bbb4fec878975fa8a4da0d1d866
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in https://github.com/pytorch/pytorch/issues/38115.
This PR allows `MaxPool` and `AdaptiveMaxPool` to accept tensors whose batch size is 0. Some changes have been made to modernize the tests so that they will show the name of C++ function that throws an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62088
Reviewed By: bdhirsh
Differential Revision: D30281285
Pulled By: jbschlosser
fbshipit-source-id: 52bffc67bfe45a78e11e4706b62cce1469eba1b9
Summary: skip rocm test for test_cudnn_convolution_relu
Test Plan: This skips a test
Reviewed By: ngimel
Differential Revision: D30233620
fbshipit-source-id: 31eab8b03c3f15674e0d262a8f55965c1aa6b809
Summary:
Currently when cudnn_convolution_relu is passed a channels last Tensor it will return a contiguous Tensor. This PR changes this behavior and bases the output format on the input format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62482
Reviewed By: ngimel
Differential Revision: D30049905
Pulled By: cpuhrsch
fbshipit-source-id: 98521d14ee03466e7128a1912b9f754ffe10b448
Summary:
Enable Gelu bf16/fp32 in CPU path using Mkldnn implementation. User doesn't need to_mkldnn() explicitly. New Gelu fp32 performs better than original one.
Add Gelu backward for https://github.com/pytorch/pytorch/pull/53615.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58525
Reviewed By: ejguan
Differential Revision: D29940369
Pulled By: ezyang
fbshipit-source-id: df9598262ec50e5d7f6e96490562aa1b116948bf
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11959
Alternative approach to creating a new `CrossEntropyLossWithSoftLabels` class. This PR simply adds support for "soft targets" AKA class probabilities to the existing `CrossEntropyLoss` and `NLLLoss` classes.
Implementation is dumb and simple right now, but future work can add higher performance kernels for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61044
Reviewed By: zou3519
Differential Revision: D29876894
Pulled By: jbschlosser
fbshipit-source-id: 75629abd432284e10d4640173bc1b9be3c52af00
Summary:
Fixes Python part of https://github.com/pytorch/pytorch/issues/60747
Enhances the Python versions of `Transformer`, `TransformerEncoderLayer`, and `TransformerDecoderLayer` to support callables as their activation functions. The old way of specifying activation function still works as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61355
Reviewed By: bdhirsh
Differential Revision: D29967302
Pulled By: jbschlosser
fbshipit-source-id: 8ee6f20083d49dcd3ab432a18e6ad64fe1e05705
Summary:
Here is the PR to enable the softmax calculation with data type of `bfloat16` when not along the last dim.
* Use bf16 specialization for forward calculation to reduce the bf16/fp32 cast in vec template.
* Release the bf16 limitation for backward calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60371
Reviewed By: ejguan
Differential Revision: D29563109
Pulled By: cpuhrsch
fbshipit-source-id: f6b439fa3850a6c633f35db65ea3d735b747863e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62281
Closes gh-24646, Closes gh-24647
There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.
I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')
for _ in range(100):
torch.nn.functional.conv2d(x, w, groups=10)
```
and similarly for backwards. I see these as the same to within measurement error.
| | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
| Forward | 133.5 | 133.6 |
| Backward (input) | 1,102 | 1,119 |
| Backward (weight) | 2,220 | 2,217 |
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D29943062
Pulled By: ngimel
fbshipit-source-id: fc5d16496eb733743face7c5a14e532d7b8ee26a
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/12013
Checks if the inputs and outputs are non-zero in order to allow the Bilinear layer to accept 0-dim batch sizes. The if-check for this checks for both input and output dim sizes since the `_trilinear` function is written to work with both forward and backward for Bilinear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47106
Reviewed By: ejguan
Differential Revision: D29935589
Pulled By: jbschlosser
fbshipit-source-id: 607d3352bd4f88e2528c64408f04999960be049d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62006
Closes gh-24646, gh-24647
There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.
I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')
for _ in range(100):
torch.nn.functional.conv2d(x, w, groups=10)
```
and similarly for backwards. I see these as the same to within measurement error.
| | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
| Forward | 133.5 | 133.6 |
| Backward (input) | 1,102 | 1,119 |
| Backward (weight) | 2,220 | 2,217 |
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D29883676
Pulled By: ngimel
fbshipit-source-id: 9b2ac62cdd8a84e1a23ffcd66035b2b2fe2374d8
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61924
The fused backward kernel was using the weight dtype to detect mixed precision usage, but the weights can be none and the `running_mean` and `running_var` can still be mixed precision. So, I update the check to look at those variables as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61962
Reviewed By: albanD
Differential Revision: D29825516
Pulled By: ngimel
fbshipit-source-id: d087fbf3bed1762770cac46c0dcec30c03a86fda
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58816
- enhance the backward of `nn.SmoothL1Loss` to allow integral `target`
- add test cases in `test_nn.py` to check the `input.grad` between the integral input and its floating counterpart.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61112
Reviewed By: mrshenli
Differential Revision: D29775660
Pulled By: albanD
fbshipit-source-id: 544eabb6ce1ea13e1e79f8f18c70f148e92be508
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61242
Previous code was wrongly checking if a tensor is a buffer in a module by comparing values; fix compares names instead.
Docs need some updating as well- current plan is to bump that to a separate PR, but I'm happy to do it here as well if preferred.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61429
Reviewed By: gchanan
Differential Revision: D29712341
Pulled By: jbschlosser
fbshipit-source-id: 41f29ab746505e60f13de42a9053a6770a3aac22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61584
add_relu is not working with broadcasting. This registers a scalar version of add_relu in native_functions that casts to tensor before calling the regular function. TensorIterator handles broadcasting analogously to existing add.
ghstack-source-id: 133480068
Test Plan: python3 test/test_nn.py TestAddRelu
Reviewed By: kimishpatel
Differential Revision: D29641768
fbshipit-source-id: 1b0ecfdb7eaf44afed83c9e9e74160493c048cbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60517
This is to fix the module support on lazymodulefixin on the bug issue #60132
Check the link: https://github.com/pytorch/pytorch/issues/60132
We will have to update lazy_extension given the dependency on module.py and update the unit test as well.
Test Plan:
Unit test passes
torchrec test passes
Reviewed By: albanD
Differential Revision: D29274068
fbshipit-source-id: 1c20f7f0556e08dc1941457ed20c290868346980
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59987
Similar as GroupNorm, improve numerical stability of LayerNorm by Welford algorithm and pairwise sum.
Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"
Reviewed By: ngimel
Differential Revision: D29115235
fbshipit-source-id: 5183346c3c535f809ec7d98b8bdf6d8914bfe790
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24610
Aten Umbrella issue https://github.com/pytorch/pytorch/issues/24507
Related to https://github.com/pytorch/pytorch/issues/59765
The performance does not change between this PR and master with the following benchmark script:
<details>
<summary>Benchmark script</summary>
```python
import torch
import torch.nn as nn
import time
torch.manual_seed(0)
def _time():
torch.cuda.synchronize()
MS_PER_SECOND = 1000
return time.perf_counter() * MS_PER_SECOND
device = "cuda"
C = 30
softmax = nn.LogSoftmax(dim=1)
n_runs = 250
for reduction in ["none", "mean", "sum"]:
for N in [100_000, 500_000, 1_000_000]:
fwd_t = 0
bwd_t = 0
data = torch.randn(N, C, device=device)
target = torch.empty(N, dtype=torch.long, device=device).random_(0, C)
loss = nn.NLLLoss(reduction=reduction)
input = softmax(data)
for i in range(n_runs):
t1 = _time()
result = loss(input, target)
t2 = _time()
fwd_t = fwd_t + (t2 - t1)
fwd_avg = fwd_t / n_runs
print(
f"input size({N}, {C}), reduction: {reduction} "
f"forward time is {fwd_avg:.2f} (ms)"
)
print()
```
</details>
## master
```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)
input size(100000, 30), reduction: mean forward time is 1.81 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)
input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```
## this PR
```
input size(100000, 30), reduction: none forward time is 0.02 (ms)
input size(500000, 30), reduction: none forward time is 0.08 (ms)
input size(1000000, 30), reduction: none forward time is 0.15 (ms)
input size(100000, 30), reduction: mean forward time is 1.80 (ms)
input size(500000, 30), reduction: mean forward time is 8.24 (ms)
input size(1000000, 30), reduction: mean forward time is 16.46 (ms)
input size(100000, 30), reduction: sum forward time is 1.66 (ms)
input size(500000, 30), reduction: sum forward time is 8.24 (ms)
input size(1000000, 30), reduction: sum forward time is 16.46 (ms)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60097
Reviewed By: mrshenli
Differential Revision: D29303099
Pulled By: ngimel
fbshipit-source-id: fc0d636543a79ea81158d286dcfb84043bec079a
Summary:
Before this change it was implemented with the assumption, that number of groups, input and output channels are the same, which is not always the case
Extend the implementation to support any number of output channels as long as number of groups equals to the number of input channels (i.e. kernel.size(1) == 1)
Fixes https://github.com/pytorch/pytorch/issues/60176
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60460
Reviewed By: albanD
Differential Revision: D29299693
Pulled By: malfet
fbshipit-source-id: 31130c71ce86535ccfba2f4929eee3e2e287b2f0
Summary:
Fixes #https://github.com/pytorch/pytorch/issues/50192
It has been discussed in the issue that, currently RNN apis do not support inputs with `seq_len=0` and the error message does not reflect this issue clearly. This PR is suggesting a solution to this issue, by adding a more clear error message that, none of RNN api (nn.RNN, nn.GRU and nn.LSTM) do not support `seq_len=0` for neither one-directional nor bi-directional layers.
```
import torch
input_size = 5
hidden_size = 6
rnn = torch.nn.GRU(input_size, hidden_size)
for seq_len in reversed(range(4)):
output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
print('{}, {}'.format(output.shape, h_n.shape))
```
Previously was giving output as :
```
torch.Size([3, 10, 6]), torch.Size([1, 10, 6])
torch.Size([2, 10, 6]), torch.Size([1, 10, 6])
torch.Size([1, 10, 6]), torch.Size([1, 10, 6])
Traceback (most recent call last):
File "test.py", line 8, in <module>
output, h_n = rnn(torch.zeros(seq_len, 10, input_size))
File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 739, in forward
result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: stack expects a non-empty TensorList
```
However, after adding this PR, this error message change for any combination of
[RNN, GRU and LSTM] x [one-directional, bi-directional].
Let's illustrate the change with the following code snippet:
```
import torch
input_size = 5
hidden_size = 6
rnn = torch.nn.LSTM(input_size, hidden_size, bidirectional=True)
output, h_n = rnn(torch.zeros(0, 10, input_size))
```
would give output as following:
```
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/fsx/users/iramazanli/pytorch/torch/nn/modules/module.py", line 1054, in _call_impl
return forward_call(*input, **kwargs)
File "/fsx/users/iramazanli/pytorch/torch/nn/modules/rnn.py", line 837, in forward
result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: Expected sequence length to be larger than 0 in RNN
```
***********************************
The change for Packed Sequence didn't seem to be necessary because from the following code snippet error message looks clear about the issue:
```
import torch
import torch.nn.utils.rnn as rnn_utils
import torch.nn as nn
packed = rnn_utils.pack_sequence([])
```
returns:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 398, in pack_sequence
return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted)
File "/fsx/users/iramazanli/pytorch/torch/nn/utils/rnn.py", line 363, in pad_sequence
return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: received an empty list of sequences
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60269
Reviewed By: mrshenli
Differential Revision: D29299914
Pulled By: iramazanli
fbshipit-source-id: 5ca98faa28d4e6a5a2f7600a30049de384a3b132
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/49825 by improving the testing
- Rename some of the old tests that had "inplace_view" in their names, but actually mean "inplace_[update_]on_view" so there is no confusion with the naming
- Adds some tests in test_view_ops that verify basic behavior
- Add tests that creation meta is properly handled for no-grad, multi-output, and custom function cases
- Add test that verifies that in the cross dtype view case, the inplace views won't be accounted in the backward graph on rebase as mentioned in the issue.
- Update inference mode tests to also check in-place
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59891
Reviewed By: albanD
Differential Revision: D29272546
Pulled By: soulitzer
fbshipit-source-id: b12acf5f0e3f788167ebe268423cdb58481b56f6
Summary:
Fixes https://github.com/pytorch/pytorch/issues/27655
This PR adds a C++ and Python version of ReflectionPad3d with structured kernels. The implementation uses lambdas extensively to better share code from the backward and forward pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59791
Reviewed By: gchanan
Differential Revision: D29242015
Pulled By: jbschlosser
fbshipit-source-id: 18e692d3b49b74082be09f373fc95fb7891e1b56
Summary:
Following https://github.com/pytorch/pytorch/issues/59624 I observed some straggling failing tests on Ampere due to TF32 thresholds. This PR just twiddles some more thresholds to fix the (6) failing tests I saw on A100.
CC Flamefire ptrblck ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60209
Reviewed By: gchanan
Differential Revision: D29220508
Pulled By: ngimel
fbshipit-source-id: 7c83187a246e1b3a24b181334117c0ccf2baf311
Summary:
Makes possible that the first register parametrization depends on a number of parameters rather than just one. Examples of these types of parametrizations are `torch.nn.utils.weight_norm` and low rank parametrizations via the multiplication of a `n x k` tensor by a `k x m` tensor with `k <= m, n`.
Follows the plan outlined in https://github.com/pytorch/pytorch/pull/33344#issuecomment-768574924. A short summary of the idea is: we call `right_inverse` when registering a parametrization to generate the tensors that we are going to save. If `right_inverse` returns a sequence of tensors, then we save them as `original0`, `original1`... If it returns a `Tensor` or a sequence of length 1, we save it as `original`.
We only allow to have many-to-one parametrizations in the first parametrization registered. The next parametrizations would need to be one-to-one.
There were a number of choices in the implementation:
If the `right_inverse` returns a sequence of parameters, then we unpack it in the forward. This is to allow to write code as:
```python
class Sum(nn.Module):
def forward(self, X, Y):
return X + Y
def right_inverse(Z):
return Z, torch.zeros_like(Z)
```
rather than having to unpack manually a list or a tuple within the `forward` function.
At the moment the errors are a bit all over the place. This is to avoid having to check some properties of `forward` and `right_inverse` when they are registered. I left this like this for now, but I believe it'd be better to call these functions when they are registered to make sure the invariants hold and throw errors as soon as possible.
The invariants are the following:
1. The following code should be well-formed
```python
X = module.weight
Y = param.right_inverse(X)
assert isinstance(Y, Tensor) or isinstance(Y, collections.Sequence)
Z = param(Y) if isisntance(Y, Tensor) else param(*Y)
```
in other words, if `Y` is a `Sequence` of `Tensor`s (we check also that the elements of the sequence are Tensors), then it is of the same length as the number parameters `param.forward` accepts.
2. Always: `X.dtype == Z.dtype and X.shape == Z.shape`. This is to protect the user from shooting themselves in the foot, as it's too odd for a parametrization to change the metadata of a tensor.
3. If it's one-to-one: `X.dtype == Y.dtype`. This is to be able to do `X.set_(Y)` so that if a user first instantiates the optimiser and then puts the parametrisation, then we reuse `X` and the user does not need to add a new parameter to the optimiser. Alas, this is not possible when the parametrisation is many-to-one. The current implementation of `spectral_norm` and `weight_norm` does not seem to care about this, so this would not be a regression. I left a warning in the documentation though, as this case is a bit tricky.
I'm still missing to go over the formatting of the documentation, I'll do that tomorrow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58488
Reviewed By: soulitzer
Differential Revision: D29100708
Pulled By: albanD
fbshipit-source-id: b9e91f439cf6b5b54d5fa210ec97c889efb9da38
Summary:
Implements a number of changes discussed with soulitzer offline.
In particular:
- Initialise `u`, `v` in `__init__` rather than in `_update_vectors`
- Initialise `u`, `v` to some reasonable vectors by doing 15 power iterations at the start
- Simplify the code of `_reshape_weight_to_matrix` (and make it faster) by using `flatten`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59564
Reviewed By: ailzhang
Differential Revision: D29066238
Pulled By: soulitzer
fbshipit-source-id: 6a58e39ddc7f2bf989ff44fb387ab408d4a1ce3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58950
Use tensor iterator's API to set grain size in order to parallelize gelu op.
ghstack-source-id: 130947174
Test Plan: test_gelu
Reviewed By: ezyang
Differential Revision: D28689819
fbshipit-source-id: 0a02066d47a4d9648323c5ec27d7e0e91f4c303a
Summary:
Make sure tests run explicitely without TF32 don't use TF32 operations
Fixes https://github.com/pytorch/pytorch/issues/52278
After the tf32 accuracy tolerance was increased to 0.05 this is the only remaining change required to fix the above issue (for TestNN.test_Conv3d_1x1x1_no_bias_cuda)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59624
Reviewed By: heitorschueroff
Differential Revision: D28996279
Pulled By: ngimel
fbshipit-source-id: 7f1b165fd52cfa0898a89190055b7a4b0985573a
Summary:
As per title. Resolves https://github.com/pytorch/pytorch/issues/56683.
`gradgradcheck` will fail once `target.requires_grad() == True` because of the limitations of the current double backward implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59447
Reviewed By: agolynski
Differential Revision: D28910140
Pulled By: albanD
fbshipit-source-id: 20934880eb4d22bec34446a6d1be0a38ef95edc7
Summary:
This PR introduces a helper function named `torch.nn.utils.skip_init()` that accepts a module class object + `args` / `kwargs` and instantiates the module while skipping initialization of parameter / buffer values. See discussion at https://github.com/pytorch/pytorch/issues/29523 for more context. Example usage:
```python
import torch
m = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1)
print(m.weight)
m2 = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1, device='cuda')
print(m2.weight)
m3 = torch.nn.utils.skip_init(torch.nn.Linear, in_features=5, out_features=1)
print(m3.weight)
```
```
Parameter containing:
tensor([[-3.3011e+28, 4.5915e-41, -3.3009e+28, 4.5915e-41, 0.0000e+00]],
requires_grad=True)
Parameter containing:
tensor([[-2.5339e+27, 4.5915e-41, -2.5367e+27, 4.5915e-41, 0.0000e+00]],
device='cuda:0', requires_grad=True)
Parameter containing:
tensor([[1.4013e-45, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
requires_grad=True)
```
Bikeshedding on the name / namespace is welcome, as well as comments on the design itself - just wanted to get something out there for discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57555
Reviewed By: zou3519
Differential Revision: D28640613
Pulled By: jbschlosser
fbshipit-source-id: 5654f2e5af5530425ab7a9e357b6ba0d807e967f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48919
move data indexing utils
parallel inference contiguous path
parallel inference channels last path
add dim apply
optimize update stats
add channels last support for backward
Revert "add channels last support for backward"
This reverts commit cc5e29dce44395250f8e2abf9772f0b99f4bcf3a.
Revert "optimize update stats"
This reverts commit 7cc6540701448b9cfd5833e36c745b5015ae7643.
Revert "add dim apply"
This reverts commit b043786d8ef72dee5cf85b5818fcb25028896ecd.
bug fix
add batchnorm nhwc test for cpu, including C=1 and HW=1
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D25399468
Pulled By: VitalyFedyunin
fbshipit-source-id: a4cd7a09cd4e1a8f5cdd79c7c32c696d0db386bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48918
enable test case on AvgPool2d channels last for CPU
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D25399466
Pulled By: VitalyFedyunin
fbshipit-source-id: 9477b0c281c0de5ed981a97e2dcbe6072d7f0aef
Summary:
Adds a new file under `torch/nn/utils/parametrizations.py` which should contain all the parametrization implementations
For spectral_norm we add the `SpectralNorm` module which can be registered using `torch.nn.utils.parametrize.register_parametrization` or using a wrapper: `spectral_norm`, the same API the old implementation provided.
Most of the logic is borrowed from the old implementation:
- Just like the old implementation, there should be cases when retrieving the weight should perform another power iteration (thus updating the weight) and cases where it shouldn't. For example in eval mode `self.training=True`, we do not perform power iteration.
There are also some differences/difficulties with the new implementation:
- Using new parametrization functionality as-is there doesn't seem to be a good way to tell whether a 'forward' call was the result of parametrizations are unregistered (and leave_parametrizations=True) or when the injected property's getter was invoked. The issue is that we want perform power iteration in the latter case but not the former, but we don't have this control as-is. So, in this PR I modified the parametrization functionality to change the module to eval mode before triggering their forward call
- Updates the vectors based on weight on initialization to fix https://github.com/pytorch/pytorch/issues/51800 (this avoids silently update weights in eval mode). This also means that we perform twice any many power iterations by the first forward.
- right_inverse is just the identity for now, but maybe it should assert that the passed value already satisfies the constraints
- So far, all the old spectral_norm tests have been cloned, but maybe we don't need so much testing now that the core functionality is already well tested
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57784
Reviewed By: ejguan
Differential Revision: D28413201
Pulled By: soulitzer
fbshipit-source-id: e8f1140f7924ca43ae4244c98b152c3c554668f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55189
Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.
Test Plan: unit tests
Reviewed By: allwu
Differential Revision: D27482738
fbshipit-source-id: deeadd391d49ff65d17d016092df1839b82806cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57558Fixes#53359
If someone directly saves an nn.LSTM in PyTorch 1.7 and then loads it in PyTorch
1.8, it errors out with the following:
```
(In PyTorch 1.7)
import torch
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')
(In PyTorch 1.8)
model = torch.load('lstm17.pt')
AttributeError: 'LSTM' object has no attribute 'proj_size'
```
Although we do not officially support this (directly saving modules via
torch.save), it used to work and the fix is very simple. This PR adds an
extra line to `__setstate__`: if the state we are passed does not have
a `proj_size` attribute, we assume it was saved from PyTorch 1.7 and
older and set `proj_size` equal to 0.
Test Plan:
I wrote a test that tests `__setstate__`. But also,
Run the following:
```
(In PyTorch 1.7)
import torch
x = torch.ones(32, 5, 2)
model = torch.nn.LSTM(2, 3)
torch.save(model, 'lstm17.pt')
y17 = model(x)
(Using this PR)
model = torch.load('lstm17.pt')
x = torch.ones(32, 5, 2)
y18 = model(x)
```
and finally compare y17 and y18.
Reviewed By: mrshenli
Differential Revision: D28198477
Pulled By: zou3519
fbshipit-source-id: e107d1ebdda23a195a1c3574de32a444eeb16191
Summary:
Fix a numerical issue of CUDA channels-last SyncBatchNorm
The added test is a repro for the numerical issue. Thanks for the help from jjsjann123 who identified the root cause. Since pytorch SBN channels-last code was migrated from [nvidia/apex](https://github.com/nvidia/apex), apex SBN channels-last also has this issue. We will submit a fix there soon.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57077
Reviewed By: mruberry
Differential Revision: D28107672
Pulled By: ngimel
fbshipit-source-id: 0c80e79ddb48891058414ad8a9bedd80f0f7f8df
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45687
Fix changes the input size check for `InstanceNorm*d` to be more restrictive and correctly reject sizes with only a single spatial element, regardless of batch size, to avoid infinite variance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56659
Reviewed By: pbelevich
Differential Revision: D27948060
Pulled By: jbschlosser
fbshipit-source-id: 21cfea391a609c0774568b89fd241efea72516bb
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56380
BC-breaking note:
This changes the behavior of full backward hooks as they will now fire properly even if no input to the Module require gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56693
Reviewed By: ezyang
Differential Revision: D27947030
Pulled By: albanD
fbshipit-source-id: e8353d769ba5a2c1b6bdf3b64e2d61308cf624a2
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55587
The fix converts the binary `TensorIterator` used by softplus backwards to a ternary one, adding in the original input for comparison against `beta * threshold`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56484
Reviewed By: malfet
Differential Revision: D27908372
Pulled By: jbschlosser
fbshipit-source-id: 73323880a5672e0242879690514a17886cbc29cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55237
In this PR, we reenable fast-gradcheck and resolve misc issues that arise:
Before landing this PR, land #55182 so that slow tests are still being run periodically.
Bolded indicates the issue is handled in this PR, otherwise it is handled in a previous PR.
**Non-determinism issues**:
- ops that do not have deterministic implementation (as documented https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms)
- test_pad_cuda (replication_pad2d) (test_nn)
- interpolate (test_nn)
- cummin, cummax (scatter_add_cuda_kernel) (test_ops)
- test_fn_gradgrad_prod_cpu_float64 (test_ops)
Randomness:
- RRelu (new module tests) - we fix by using our own generator as to avoid messing with user RNG state (handled in #54480)
Numerical precision issues:
- jacobian mismatch: test_gelu (test_nn, float32, not able to replicate locally) - we fixed this by disabling for float32 (handled in previous PR)
- cholesky_solve (test_linalg): #56235 handled in previous PR
- **cumprod** (test_ops) - #56275 disabled fast gradcheck
Not yet replicated:
- test_relaxed_one_hot_categorical_2d (test_distributions)
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D27920906
fbshipit-source-id: 894dd7bf20b74f1a91a5bc24fe56794b4ee24656
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53964. cc albanD almson
## Major changes:
- Overhauled the actual loss calculation so that the shapes are now correct (in functional.py)
- added the missing doc in nn.functional.rst
## Minor changes (in functional.py):
- I removed the previous check on whether input and target were the same shape. This is to allow for broadcasting, say when you have 10 predictions that all have the same target.
- I added some comments to explain each shape check in detail. Let me know if these should be shortened/cut.
Screenshots of updated docs attached.
Let me know what you think, thanks!
## Edit: Description of change of behaviour (affecting BC):
The backwards-compatibility is only affected for the `reduction='none'` mode. This was the source of the bug. For tensors with size (N, D), the old returned loss had size (N), as incorrect summation was happening. It will now have size (N, D) as expected.
### Example
Define input tensors, all with size (2, 3).
`input = torch.tensor([[0., 1., 3.], [2., 4., 0.]], requires_grad=True)`
`target = torch.tensor([[1., 4., 2.], [-1., 2., 3.]])`
`var = 2*torch.ones(size=(2, 3), requires_grad=True)`
Initialise loss with reduction mode 'none'. We expect the returned loss to have the same size as the input tensors, (2, 3).
`loss = torch.nn.GaussianNLLLoss(reduction='none')`
Old behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([3.7897, 6.5397], grad_fn=<MulBackward0>. This has size (2).`
New behaviour:
`print(loss(input, target, var)) `
`# Gives tensor([[0.5966, 2.5966, 0.5966], [2.5966, 1.3466, 2.5966]], grad_fn=<MulBackward0>)`
`# This has the expected size, (2, 3).`
To recover the old behaviour, sum along all dimensions except for the 0th:
`print(loss(input, target, var).sum(dim=1))`
`# Gives tensor([3.7897, 6.5397], grad_fn=<SumBackward1>.`


Pull Request resolved: https://github.com/pytorch/pytorch/pull/56469
Reviewed By: jbschlosser, agolynski
Differential Revision: D27894170
Pulled By: albanD
fbshipit-source-id: 197890189c97c22109491c47f469336b5b03a23f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54812
Needed for quantization since different attribute might refer to the same module instance
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D27408376
fbshipit-source-id: cada85c4a1772d3dd9502c3f6f9a56d690d527e7
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100#43112
EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285
Reviewed By: mruberry
Differential Revision: D27765694
Pulled By: jbschlosser
fbshipit-source-id: c34774fa065d67c0ac130de20a54e66e608bdbf4
Summary:
This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction.
This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided.
Fixes https://github.com/pytorch/pytorch/issues/3194
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237
Reviewed By: walterddr, VitalyFedyunin
Differential Revision: D26948258
Pulled By: jbschlosser
fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25100#43112
EDIT: pardon my inexperience since this is my first PR here, that I did not realize the doc should not have any trailing white spaces, and `[E712] comparison to False should be 'if cond is False:' or 'if not cond:'`, now both fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55285
Reviewed By: ngimel
Differential Revision: D27710107
Pulled By: jbschlosser
fbshipit-source-id: c4363a4604548c0d84628c4997dd23d6b3afb4d9
Summary:
This PR adds the functionality to use channals_last_3d, aka, NDHWC, in Conv3d. It's only enabled when cuDNN version is greater than or equal to 8.0.5.
Todo:
- [x] add memory_format test
- [x] add random shapes functionality test
Close https://github.com/pytorch/pytorch/pull/52547
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48430
Reviewed By: mrshenli
Differential Revision: D27641452
Pulled By: ezyang
fbshipit-source-id: 0e98957cf30c50c3390903d307dd43bdafd28880
Summary:
There was an error when removing a parametrization with `leave_parametrized=True`. It had escaped the previous tests. This PR should fix that.
**Edit.**
I also took this chance to fix a few mistakes that the documentation had, and to also write the `set_original_` in a more compact way.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55456
Reviewed By: mrshenli
Differential Revision: D27620481
Pulled By: albanD
fbshipit-source-id: f1298ddbcf24566ef48850c62a1eb4d8a3576152
Summary:
Non-backwards-compatible change introduced in https://github.com/pytorch/pytorch/pull/53843 is tripping up a lot of code. Better to set it to False initially and then potentially flip to True in the later version to give people time to adapt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55169
Reviewed By: mruberry
Differential Revision: D27511150
Pulled By: jbschlosser
fbshipit-source-id: 1ac018557c0900b31995c29f04aea060a27bc525
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48917
max_pool2d channels last support forward path
max_pool2d channels last support backward path
vectorize channels last forward path
rename the header file
fix windows build
combine PoolingKernel.h into Pool.h
add data type check
loosen test_max_pool2d_nhwc to cover device CPU
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D25399470
Pulled By: VitalyFedyunin
fbshipit-source-id: b49b9581f1329a8c2b9c75bb10f12e2650e4c65a
Summary:
This PR enables using MIOpen for RNN FP16 on ROCM.
It does this by altering use_miopen to allow fp16. In the special case where LSTMs use projections we use the default implementation, as it is not implemented in MIOpen at this time. We do send out a warning once to let the user know.
We then remove the various asserts that are no longer necessary since we handle the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52475
Reviewed By: H-Huang
Differential Revision: D27449150
Pulled By: malfet
fbshipit-source-id: 06499adb94f28d4aad73fa52890d6ba361937ea6
Summary:
Skips the tests indicated as failing in https://github.com/pytorch/pytorch/issues/54535.
During the ROCm CI upgrade from 4.0.1 to 4.1, some tests regressed. Specifically, FFT tests in test_spectral_ops.py and test_grid_sample in test_nn.py. In order to keep a passing CI signal, we need to disable these temporarily.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54536
Reviewed By: H-Huang
Differential Revision: D27442974
Pulled By: malfet
fbshipit-source-id: 07dffb957757a5fc7afaa5bf78b935a427251ef4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54901
Some subtleties:
- Need to make sure not to clobber composite definitions when
deciding when to generate
- I was lazy and so I didn't make inplace on TensorList work,
nor did I make inplace functions that returned void work
- A few tests started complaining that these noop meta functions
weren't raising the errors they needed. This is tracked
in https://github.com/pytorch/pytorch/issues/54897
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D27407232
Pulled By: ezyang
fbshipit-source-id: 5e706a267496368acdafd128942c310954e43d29
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54452
The assertion that fails in the issue is necessary to appease mypy. Instead, I fix `_ntuple` to always return a `tuple`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54911
Reviewed By: H-Huang
Differential Revision: D27411088
Pulled By: jbschlosser
fbshipit-source-id: 7f5045c58dd4f5f3b07b4826d9b4ca85606c5bce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53655
Currently EmbeddingBag and it variants support either int32 or int64 indices/offsets. We have use cases where there are mix of int32 and int64 indices which are not supported yet. To avoid introducing too many branches we could simply cast offsets type to indices type when they are not the same.
Test Plan: unit tests
Reviewed By: qizzzh
Differential Revision: D26820202
fbshipit-source-id: 3e8f09523329ea12393ea92ee9a6315aa40a0b7f
Summary:
**BC-breaking note**: This change throws errors for cases that used to silently pass. The old behavior can be obtained by setting `error_if_nonfinite=False`
Fixes https://github.com/pytorch/pytorch/issues/46849
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53843
Reviewed By: malfet
Differential Revision: D27291838
Pulled By: jbschlosser
fbshipit-source-id: 216d191b26e1b5919a44a3af5cde6f35baf825c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54744
Fixes https://github.com/pytorch/pytorch/issues/54590
After the porting the upsample operators to be structured, they now forward memory_format information to the output. This is a problem for the cuda kernels, which are not implemented to deal with `torch.channels_last` memory format. The operators are:
* upsample_nearest2d
* upsample_bilinear2d
* upsample_nearest3d
* upsample_trilinear3d
This fix just allocates a temporary, contiguous output tensor when that happens, writes the results to the temporary and copies the results back to the output tensor.
I held off on adding tests to get the fix out quickly, but I wrote a script and ran some manual tests, that basically just asserts that the outputs are the same for cpu and cuda, for some threshold. I ran it for all 4 operators:
```
import torch
def basically_equal(t1, t2):
epsilon = 1e-4
diffs = torch.abs(t1 - t2)
print(torch.all(diffs < 1e-4))
# upsample 2d
a = torch.arange(48).reshape(2, 2, 3, 4).contiguous(memory_format=torch.channels_last).float()
out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='nearest')
basically_equal(out_cpu, out_cuda.to("cpu"))
out_cpu = torch.nn.functional.interpolate(a, scale_factor=2, mode='bilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=2, mode='bilinear', align_corners=True)
basically_equal(out_cpu, out_cuda.to("cpu"))
# upsample 3d
a = torch.arange(96).reshape(2, 2, 2, 3, 4).contiguous(memory_format=torch.channels_last_3d).float()
out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='nearest')
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='nearest')
basically_equal(out_cpu, out_cuda.to("cpu"))
out_cpu = torch.nn.functional.interpolate(a, scale_factor=3, mode='trilinear', align_corners=True)
out_cuda = torch.nn.functional.interpolate(a.to('cuda'), scale_factor=3, mode='trilinear', align_corners=True)
basically_equal(out_cpu, out_cuda.to("cpu"))
```
prints
```
tensor(True)
tensor(True)
tensor(True)
tensor(True)
```
One thing that was weird- `upsample_bilinear2d` and `upsample_trilinear3d` were only accurate across cpu/cuda with an epsilon of `1e-4`. That tentatively sounds close enough to say that cuda isn't "wrong" (?), but that's not exactly "equal"... and I also ran the script before my change, and `bilinear2d` and `trilinear3d` were also the same across cpu/cuda with an epsilon of `1e-4`.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D27351393
Pulled By: bdhirsh
fbshipit-source-id: b33f46e4855dc8b49b363770190b639beebbf5a7
Summary:
The fallback thnn 2d convolution uses `im2col` to get patches and `gemm` to implement convolution .
I has a shortcut to use `gemm` directly for kernel size 1, but this only works for stride == 1 and padding == 0.
This PR adds checks for stride == 1 and padding == 0 to determining whether `im2col` can be skipped.
Fixes https://github.com/pytorch/pytorch/issues/54036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54080
Reviewed By: ejguan
Differential Revision: D27170482
Pulled By: zou3519
fbshipit-source-id: 055d6502239d34945934de409d78144d8a5c56f4
Summary:
Also modify the `tf32_on_and_off` decorator to make it support function without `device` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52871
Reviewed By: ngimel
Differential Revision: D27286674
Pulled By: mruberry
fbshipit-source-id: 14f6d558271bd6a1d0bc40691c170d47e81de1ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45667
First part of #3867 (Pooling operators still to do)
This adds a `padding='same'` mode to the interface of `conv{n}d`and `nn.Conv{n}d`. This should match the behaviour of `tensorflow`. I couldn't find it explicitly documented but through experimentation I found `tensorflow` returns the shape `ceil(len/stride)` and always adds any extra asymmetric padding onto the right side of the input.
Since the `native_functions.yaml` schema doesn't seem to support strings or enums, I've moved the function interface into python and it now dispatches between the numerically padded `conv{n}d` and the `_conv{n}d_same` variant. Underscores because I couldn't see any way to avoid exporting a function into the `torch` namespace.
A note on asymmetric padding. The total padding required can be odd if both the kernel-length is even and the dilation is odd. mkldnn has native support for asymmetric padding, so there is no overhead there, but for other backends I resort to padding the input tensor by 1 on the right hand side to make the remaining padding symmetrical. In these cases, I use `TORCH_WARN_ONCE` to notify the user of the performance implications.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D27170744
Pulled By: jbschlosser
fbshipit-source-id: b3d8a0380e0787ae781f2e5d8ee365a7bfd49f22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53665
ngimel pointed out to me where we already test the behavior of the `Upsample` ops in `test_nn.py`. This PR deleting my bespoke tests in `test_torch.py` and updates those in `test_nn.py` to test memory format properly.
There were two reasons the original test didn't pick up on a memory format regression:
- They didn't test the memory format of the output tensor explicitly, i.e. `output.is_contiguous(memory_format=...)`
- Even with that change, the test tensors were to simple to fail the tests. From some trial and error, it looks like one of the first two dimensions in the inputs needs to be > 1 in order for the `channels_last` memory format to actually re-order the strides.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26929683
Pulled By: bdhirsh
fbshipit-source-id: d17bc660ff031e9b3e2c93c60a9e9308e56ea612
Summary:
Provides the implementation for feature request issue https://github.com/pytorch/pytorch/issues/28937.
Adds the `Parametrization` functionality and implements `Pruning` on top of it.
It adds the `auto` mode, on which the parametrization is just computed once per forwards pass. The previous implementation computed the pruning on every forward, which is not optimal when pruning RNNs for example.
It implements a caching mechanism for parameters. This is implemented through the mechanism proposed at the end of the discussion https://github.com/pytorch/pytorch/issues/7313. In particular, it assumes that the user will not manually change the updated parameters between the call to `backwards()` and the `optimizer.step()`. If they do so, they would need to manually call the `.invalidate()` function provided in the implementation. This could be made into a function that gets a model and invalidates all the parameters in it. It might be the case that this function has to be called in the `.cuda()` and `.to` and related functions.
As described in https://github.com/pytorch/pytorch/issues/7313, this could be used, to implement in a cleaner way the `weight_norm` and `spectral_norm` functions. It also allows, as described in https://github.com/pytorch/pytorch/issues/28937, for the implementation of constrained optimization on manifolds (i.e. orthogonal constraints, positive definite matrices, invertible matrices, weights on the sphere or the hyperbolic space...)
TODO (when implementation is validated):
- More thorough test
- Documentation
Resolves https://github.com/pytorch/pytorch/issues/28937
albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33344
Reviewed By: zhangguanheng66
Differential Revision: D26816708
Pulled By: albanD
fbshipit-source-id: 07c8f0da661f74e919767eae31335a9c60d9e8fe
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38137
As mentioned in the issue, this is a workaround for [python issue 43367](https://bugs.python.org/issue43367). There are a number of other places where `sys.modules` is modified, if something changes in python perhaps those should be reviewed as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53107
Reviewed By: zou3519
Differential Revision: D26753571
Pulled By: ezyang
fbshipit-source-id: 2bda03bab39ff9ca58ce4bc13befe021da91b9c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52671
Code is written with the assumption that new_size is unsigned value,
and when function is called with negative value it silently returns a nullptr rather than raise an exception.
Fix above-mentioned logic by converting new_size to unsigned type and let cpu_allocator raise exception on negative alloc.
Unroll nested if blocks by returning early if new_size is 0
Add TestNN.test_adaptive_pooling_size_overflow to indirecty validate the fix.
Fixes https://github.com/pytorch/pytorch/issues/50960
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D26607549
Pulled By: malfet
fbshipit-source-id: e3d4f7548b098f24fa5aba42d8f4e9288ece1e2e
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52257
## Background
Reverts MHA behavior for `bias` flag to that of v1.5: flag enables or disables both in and out projection biases.
Updates type annotations for both in and out projections biases from `Tensor` to `Optional[Tensor]` for `torch.jit.script` usage.
Note: With this change, `_LinearWithBias` defined in `torch/nn/modules/linear.py` is no longer utilized. Completely removing it would require updates to quantization logic in the following files:
```
test/quantization/test_quantized_module.py
torch/nn/quantizable/modules/activation.py
torch/nn/quantized/dynamic/modules/linear.py
torch/nn/quantized/modules/linear.py
torch/quantization/quantization_mappings.py
```
This PR takes a conservative initial approach and leaves these files unchanged.
**Is it safe to fully remove `_LinearWithBias`?**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52537
Test Plan:
```
python test/test_nn.py TestNN.test_multihead_attn_no_bias
```
## BC-Breaking Note
In v1.6, the behavior of `MultiheadAttention`'s `bias` flag was incorrectly changed to affect only the in projection layer. That is, setting `bias=False` would fail to disable the bias for the out projection layer. This regression has been fixed, and the `bias` flag now correctly applies to both the in and out projection layers.
Reviewed By: bdhirsh
Differential Revision: D26583639
Pulled By: jbschlosser
fbshipit-source-id: b805f3a052628efb28b89377a41e06f71747ac5b
Summary:
Some minor improvement for lazy modules introduced in https://github.com/pytorch/pytorch/issues/44538, https://github.com/pytorch/pytorch/issues/47350 and https://github.com/pytorch/pytorch/issues/51548.
This PR mainly turn the bias to `UninitializedParameter` and instead of creating empty tensors like
```python
self.bias = Parameter(torch.Tensor(0))
self.bias = UninitializedParameter()
```
I think it would be better to
```python
self.register_parameter('bias', None)
self.bias = UninitializedParameter()
```
In addition, I change the constructor of the `LazyBatchNorm` from
```python
self.running_mean = UninitializedBuffer()
```
to
```python
self.register_buffer('running_mean', UninitializedBuffer())
```
as the original one would not change the underlying `self._buffers`.
Thank you for your time on reviewing this PR :).
Gently ping albanD, mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52212
Reviewed By: jbschlosser
Differential Revision: D26504508
Pulled By: albanD
fbshipit-source-id: 7094d0bb4fa9e2a40a07b79d350ea12a6ebfd080
Summary:
Temporary disabling OneDNN conv for group size = 24 as OneDNN update came too late to be fully tested https://github.com/pytorch/pytorch/issues/50042
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52327
Reviewed By: agolynski
Differential Revision: D26474186
Pulled By: VitalyFedyunin
fbshipit-source-id: 8d6964d33c8dcab70e207088c3940810eabbd068
Summary:
Because this pull request (https://github.com/pytorch/pytorch/issues/40801) becomes an important part of recent 3D models, brings significant improvement in speed, and also have been open for a while. So I decided to resolve the previous review comment and modify it a bit so that it can be merged into the latest version of Pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51027
Reviewed By: albanD
Differential Revision: D26414116
Pulled By: ngimel
fbshipit-source-id: 562c099f4d7f6d603a9c2f2e2a518bc577b0d8ee
Summary:
Adding CUDA 11.2 to Windows CI.
Disabled tests:
The following ran into `CUDA error: misaligned address` for CUDA 11.2: (issue linked below)
`test_where_scalar_valid_combination_cuda_complex128` in test_torch.py
`test_sgn_complex_cuda` in test_autograd.py
The following ran into `CUDA error: too many resources requested for launch` for CUDA 11.2: (https://github.com/pytorch/pytorch/issues/52002)
test_EmbeddingBag_per_sample_weights_and_new_offsets_cuda_int64_float64
test_EmbeddingBag_per_sample_weights_and_offsets_cuda_int64_float64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51598
Reviewed By: mrshenli
Differential Revision: D26344965
Pulled By: janeyx99
fbshipit-source-id: 3c9a4ed16d748969e96593220ec0a9f33e1ffcef
Summary:
For none support input, we should not do check in a parallel region, this PR will first do the dtype check, and then do parallel for.
Fixes https://github.com/pytorch/pytorch/issues/51352.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51443
Reviewed By: izdeby
Differential Revision: D26305584
Pulled By: ngimel
fbshipit-source-id: 6faa3148af5bdcd7246771c0ecb4db2b31ac82c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50794
Original commit changeset: b4a7948088c0
There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised.
* There's a bugfix in the codegen to test if a dispatch key is structured *before* short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU)
* Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them.
* It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this
* There are some extra sanity checks for NativeFunctions validation
* Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d.
ghstack-source-id: 120038038
Test Plan:
```
buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled
```
Reviewed By: ngimel
Differential Revision: D25962873
fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50739
This does not turn on batched grad testing for autogenerated NewModuleTest
tests and CriterionTest tests. Those are coming later.
Test Plan: - run tests
Reviewed By: ejguan
Differential Revision: D25997677
Pulled By: zou3519
fbshipit-source-id: b4b2d68e0f99c3d573faf237e1e531d0b3fced40
Summary:
Fixes #{[24991](https://github.com/pytorch/pytorch/issues/24991)}
I used a value of 0.75 as suggested in the forums by Thomas: https://discuss.pytorch.org/t/calculate-gain-tanh/20854/6
I verified that the value keeps the gradient stable for a 100-layer network.
Code to reproduce (from [jpeg729](https://discuss.pytorch.org/t/calculate-gain-tanh/20854/4)):
```python
import torch
import torch.nn.functional as F
import sys
a = torch.randn(1000,1000, requires_grad=True)
b = a
print (f"in: {a.std().item():.4f}")
for i in range(100):
l = torch.nn.Linear(1000,1000, bias=False)
torch.nn.init.xavier_normal_(l.weight, torch.nn.init.calculate_gain("selu"))
b = getattr(F, 'selu')(l(b))
if i % 10 == 0:
print (f"out: {b.std().item():.4f}", end=" ")
a.grad = None
b.sum().backward(retain_graph=True)
print (f"grad: {a.grad.abs().mean().item():.4f}")
```
Output:
```
in: 1.0008
out: 0.7968 grad: 0.6509
out: 0.3127 grad: 0.2760
out: 0.2404 grad: 0.2337
out: 0.2062 grad: 0.2039
out: 0.2056 grad: 0.1795
out: 0.2044 grad: 0.1977
out: 0.2005 grad: 0.2045
out: 0.2042 grad: 0.2273
out: 0.1944 grad: 0.2034
out: 0.2085 grad: 0.2464
```
I included the necessary documentation change, and it passes the _test_calculate_gain_nonlinear_ unittest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50664
Reviewed By: mruberry
Differential Revision: D25942217
Pulled By: ngimel
fbshipit-source-id: 29ff1be25713484fa7c516df71b12fdaecfb9af8
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42588
The contiguity check used to be for memory format suggested by `grad_output->suggest_memory_format()`, but an invariant guaranteed by derivatives.yaml is `input->suggest_memory_format()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50659
Reviewed By: mruberry
Differential Revision: D25938921
Pulled By: ngimel
fbshipit-source-id: a945bfef6ce3d91b17e7ff96babe89ffd508939a
Summary:
Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640)
Things added in this PR:
1. Modify backward and double-backward formulas
2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1)
3. Modify some existing tests to support complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912
Reviewed By: zhangguanheng66
Differential Revision: D25853036
Pulled By: soulitzer
fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49726
Just cleaned up the unnecessary `ModuleAttributeError`
BC-breaking note:
`ModuleAttributeError` was added in the previous unsuccessful [PR](https://github.com/pytorch/pytorch/pull/49879) and removed here. If a user catches `ModuleAttributeError` specifically, this will no longer work. They should catch `AttributeError` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50298
Reviewed By: mrshenli
Differential Revision: D25907620
Pulled By: jbschlosser
fbshipit-source-id: cdfa6b1ea76ff080cd243287c10a9d749a3f3d0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48378
This commit adds support for accepting custom importance scores to use for pruning mask computation, rather than only using the parameter.
This is useful if one wants to prune based on scores from different technique such as activations, gradients, weighted scoring of parameters, etc.
An alternative to the above approach would be pass the custom mask to the already available interface. However, the ability to accept importance scores is easier it can leverage the mask computation logic that has already been baked in.
In addition, the commit also makes some minor lint fixes.
Test Plan:
* Unit tests
* Circle CI
Differential Revision: D24997355
fbshipit-source-id: 30797897977b57d3e3bc197987da20e88febb1fa
Summary:
Fixes https://github.com/pytorch/pytorch/issues/598
This is BC-breaking as we now explicitly don't call the hook when there are not Tensors at the top level of the output.
This feature was not working anyways as the returned grad_input/grad_output were wrong (not respecting the output structure and wrong inputs for multi-Node Module).
This is also BC-breaking as we now report the correct gradients for `nn.Module`s that contain multiple autograd `Node`s while we use to return bad results before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46163
Reviewed By: ailzhang, mruberry
Differential Revision: D24894180
Pulled By: albanD
fbshipit-source-id: e1b5d193d2818eb2f51e2a2722c7405c8bd13c2b
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46213
I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should.
1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes.
2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that.
3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places.
4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that?
5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47725
Reviewed By: zou3519
Differential Revision: D25449794
Pulled By: ngimel
fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49187
Expands the implementation of PixelShuffle to support any number of batch dimensions
Test Plan: `buck test caffe2/test:nn -- test_pixel_shuffle`
Reviewed By: mruberry
Differential Revision: D25399058
fbshipit-source-id: ab0a7f593b276cafc9ebb46a177e2c1dce56d0de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48916
optimize adaptive average pool2d forward path
optimize adaptive average pool2d backward path
remove unused headers
minor change
minor change
rename the header; add adaptive max pooling in future.
minor change
loosen adapative_pool2d test on nhwc to both device cuda and cpu
minor change
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D25399469
Pulled By: VitalyFedyunin
fbshipit-source-id: 86f9fda35194f21144bd4667b778c861c05a5bac
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46983.
The solution is based of two components:
1. The introduction of the `_initialized` attribute. This will be used during ParameterList/Dict creation methods `__init__` (introduced in https://github.com/pytorch/pytorch/issues/47772) and `__setstate__` to not trigger warnings when setting general `Module` attributes.
2. The introduction of the `not hasattr(self, key)` check to avoid triggering warnings when changing general `Module` attributes such as `.training` during the `train()` and `eval()` methods.
Tests related to the fix are added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48315
Reviewed By: mrshenli
Differential Revision: D25130217
Pulled By: albanD
fbshipit-source-id: 79e2abf1eab616f5de74f75f370c2fe149bed4cb
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941
Reviewed By: heitorschueroff
Differential Revision: D24852725
Pulled By: mruberry
fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758
It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.
Test Plan: unit tests
Reviewed By: ngimel
Differential Revision: D24470808
fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
Summary:
Fix https://github.com/pytorch/pytorch/issues/44601
I added bicubic grid sampler in both cpu and cuda side, but haven't in AVX2
There is a [colab notebook](https://colab.research.google.com/drive/1mIh6TLLj5WWM_NcmKDRvY5Gltbb781oU?usp=sharing) show some test results. The notebook use bilinear for test, since I could only use distributed version of pytorch in it. You could just download it and modify the `mode_torch=bicubic` to show the results.
There are some duplicate code about getting and setting values, since the helper function used in bilinear at first clip the coordinate beyond boundary, and then get or set the value. However, in bicubic, there are more points should be consider. I could refactor that part after making sure the overall calculation are correct.
Thanks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44780
Reviewed By: mrshenli
Differential Revision: D24681114
Pulled By: mruberry
fbshipit-source-id: d39c8715e2093a5a5906cb0ef040d62bde578567
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46558
This PR fixes a bug with how pooling output shape was computed.
## BC Breaking Notes
Previously, a bug in the pooling code allowed a sliding window to be entirely off bounds. Now, sliding windows must start inside the input or left padding (not right padding, see https://github.com/pytorch/pytorch/issues/46929) and may only go off-bounds if ceil_mode=True.
fixes#45357
TODO
- [x] Ensure existing tests are checking for the correct output size
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D24633372
Pulled By: heitorschueroff
fbshipit-source-id: 55925243a53df5d6131a1983076f11cab7516d6b
Summary:
This PR disables the test_softmax and test_softmax_results in test_nn.py that were enabled in https://github.com/pytorch/pytorch/issues/46363. The softmax tests are causing failure on gfx906 machines. Disabling those until we root cause and fix them on 906.
cc: jeffdaily ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46793
Reviewed By: izdeby
Differential Revision: D24539211
Pulled By: ezyang
fbshipit-source-id: 633cb9dc497ad6359af85b85a711c4549d772b2a
Summary:
This pull request enables the following tests on ROCm:
* TestCuda.test_tiny_half_norm_
* TestNNDeviceTypeCUDA.test_softmax_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_cuda_float32
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float16
* TestNNDeviceTypeCUDA.test_softmax_results_cuda_float32
The earlier failures, because of which the tests were skipped, were because of a precision issue for FP16 compute on MI25 hardware with ROCm 3.7 and older. The fix was delivered in the compiler in ROCm 3.8.
The pull request fixes https://github.com/pytorch/pytorch/issues/37493
cc: jeffdaily ezyang malfet mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46363
Reviewed By: heitorschueroff
Differential Revision: D24325639
Pulled By: ezyang
fbshipit-source-id: a7dbb238cf38c04b6592baad40b4d71725a358c9
Summary:
Close https://github.com/pytorch/pytorch/issues/31690
I have verified the functionality of ConvTranspose2d (with this PR) on roughly 32,000 random shapes on V100, A100, using cuDNN 8.0.4 and CUDA 11.1. The 32,000 shapes contain 4x8,000 of (fp16, fp32) x (nchw, nhwc) each.
The random shapes are sampled from
```jsonc
{
"batch_size": {"low": 1, "high": 8},
"in_channels": {"low": 16, "high": 128},
"out_channels": {"low": 16, "high": 128},
"height": {"low": 16, "high": 224},
"stride": {"set": [[1, 1], [2, 2]]},
"padding": {"set": [[0, 0]]},
"output_padding": {"set": [[0, 0], [1, 1], [0, 1], [1, 0]]},
"kernel_size": {"set": [[3, 3], [1, 1], [1, 3], [3, 1], [2, 2]]},
"dilation": {"set": [[1, 1]]},
"deterministic": {"set": [true, false]},
"benchmark": {"set": [true, false]},
"allow_tf32": {"set": [true, false]},
"groups": {"set": [1, IN_CHANNELS]}
}
```
- Input `width` is the same as `height`.
- `groups` can be either 1, or the same as `in_channels` (grouped convolution). When `groups` is 1, `out_channels` is random; when `groups` is the same as `in_channels`, `out_channels` is also the same as `in_channels`
All of the checked shapes can be found in csv files here https://github.com/xwang233/code-snippet/tree/master/convtranspose2d-dilation/functionality-check-cudnn8.0.4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46290
Reviewed By: mruberry
Differential Revision: D24422091
Pulled By: ngimel
fbshipit-source-id: 9f0120f2995ae1575c0502f1b2742390d7937b24
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46572
When `num_samples == 0`, grid becomes zero. Although CUDA just silently proceeds, `cudaGetLastError()` will complain about the `Error: invalid configuration argument`. So it's actually failing in some future places that becomes really hard to debug.
Reviewed By: jianyuh
Differential Revision: D24409874
fbshipit-source-id: ca54de13b1ab48204bbad265e3f55b56b94a1a2f
Summary:
This PR makes it possible to cast the parameters of nn.Module to complex dtypes.
The following code works with the proposed changes.
```python
In [1]: import torch
In [2]: lin = torch.nn.Linear(5, 1).to(torch.complex64)
In [3]: lin(torch.zeros(3, 5, dtype=torch.complex64))
Out[3]:
tensor([[-0.1739+0.j],
[-0.1739+0.j],
[-0.1739+0.j]], grad_fn=<AddmmBackward>)
```
Fixes https://github.com/pytorch/pytorch/issues/43477.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44788
Reviewed By: zou3519
Differential Revision: D24307225
Pulled By: anjali411
fbshipit-source-id: dacc4f5c8c9a99303f74d1f5d807cd657b3b69b5
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD
This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.
The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.
As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
def __init__(self):
super(MyNetwork, self).__init__()
self.conv = torch.nn.Conv2d(20, 4, 2)
self.linear = torch.nn.LazyLinear(10)
def forward(self, x):
y = self.conv(x).clamp(min=0)
return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).
TODO:
Add convolutions once the design is OK
Fix docstrings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538
Reviewed By: ngimel
Differential Revision: D24162854
Pulled By: albanD
fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
Summary:
This PR patches the ReplicationPad modules in `torch.nn` to be compatible with 0-dim batch sizes.
EDIT: this is part of the work on gh-12013 (make all nn layers accept empty batch size)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39137
Reviewed By: albanD
Differential Revision: D24131386
Pulled By: ngimel
fbshipit-source-id: 3d93057cbe14d72571943c8979d5937e4bbf743a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45474
When batchnorm affine is set to false, weight and bias is set to None, which is not supported in this case. Added a fix to set weights to 1 and bias to 0 if they are not set.
Test Plan: Add unit test for testing fusing conv, batchnorm where batchnorm is in affine=False mode.
Reviewed By: z-a-f
Differential Revision: D23977080
fbshipit-source-id: 2782be626dc67553f3d27d8f8b1ddc7dea022c2a
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`
cc: ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240
Reviewed By: mruberry
Differential Revision: D23882498
Pulled By: ngimel
fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43680
As discussed [here](https://github.com/pytorch/pytorch/issues/43342),
adding in a Python-only implementation of the triplet-margin loss that takes a
custom distance function. Still discussing whether this is necessary to add to
PyTorch Core.
Test Plan:
python test/run_tests.py
Imported from OSS
Reviewed By: albanD
Differential Revision: D23363898
fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb
Summary:
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800
`dX` is a Tensor, comparing `dX` with `nullptr` was wrong.
cc BIT-silence who wrote the kernel.
The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44863
Reviewed By: mruberry
Differential Revision: D23754101
Pulled By: BIT-silence
fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
Summary:
This PR adds dilation to _ConvTransposeNd._output_padding method and tests using a bunch of different sized inputs.
Fixes https://github.com/pytorch/pytorch/issues/14272
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43793
Reviewed By: zou3519
Differential Revision: D23493313
Pulled By: ezyang
fbshipit-source-id: bca605c428cbf3a97d3d24316d8d7fde4bddb307
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44398
These end up executing the same tests, so no reason to have them separate.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23600855
Pulled By: gchanan
fbshipit-source-id: 0952492771498bf813f1bf8e1d7c8dce574ec965
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44382
This is to fix a typo that introduced in #44032.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23601316
Pulled By: glaringlee
fbshipit-source-id: 17d6de5900443ea46c7a6ee9c7614fe6f2d92890
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.
Fixes https://github.com/pytorch/pytorch/issues/41780
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550
Reviewed By: smessmer
Differential Revision: D23040744
Pulled By: albanD
fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41656
For the CPU version, this is a regression introduced in https://github.com/pytorch/pytorch/issues/10980 which vectorized the `grid_sampler_2d` implementation. It uses the AVX2 gather intrinsic which for `float` requires 32-bit indexing to match the number of floats in the AVX register. There is also an `i64gather_ps` variant but this only utilizes half of the vector width so would be expected to give worse performance in the more likely case where 32-bit indexing is acceptable. So, I've left the optimised AVX version as-is and reinstated the old non-vectorized version as a fallback.
For the CUDA version, this operation has never supported 32-bit indexing so this isn't a regression. I've templated the kernel on index type and added 64-bit variants. Although I gather in some places a simple `TORCH_CHECK(canUse32BitIndexMath(...))` is used instead. So, there is a decision to be made here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41923
Reviewed By: glaringlee
Differential Revision: D22925931
Pulled By: zou3519
fbshipit-source-id: 920816107aae26360c5e7f4e9c729fa9057268bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42215
Specifically on https://github.com/pytorch/pytorch/pull/27477#discussion_r371402079
We would like to supported with include_last=True overall for other reduction types like mean and max. It now causes further code fragmentation in DPER (https://www.internalfb.com/intern/diff/D22794469/).
More details: https://www.internalfb.com/intern/diff/D22794469/?dest_fbid=309597093427021&transaction_id=631457624153457
ghstack-source-id: 108733009
Test Plan:
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
```
```
(base) [jianyuhuang@devbig281.ftw3.facebook.com: ~/fbsource/fbcode/caffe2/test] $ TORCH_SHOW_CPP_STACKTRACES=1 buck test mode/dev-nosan //caffe2/test:
nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" --print-passing-details
Parsing buck files: finished in 1.2 sec
Building: finished in 5.5 sec (100%) 10130/10130 jobs, 2 updated
Total time: 6.7 sec
More details at https://www.internalfb.com/intern/buck/build/dbdc2063-69d8-45cb-9146-308a9e8505ef
First unknown argument: --print-passing-details.
Falling back to TestPilot classic.
Trace available for this run at /tmp/testpilot.20200728-195414.1422748.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
✓ caffe2/test:nn - test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) 0.162 1/1 (passed)
Test output:
> /data/users/jianyuhuang/fbsource/fbcode/buck-out/dev/gen/caffe2/test/nn#binary,link-tree/torch/_utils_internal.py:103: DeprecationWarning: This is a NOOP in python >= 3.7, its just too dangerous with how we write code at facebook. Instead we patch os.fork and multiprocessing which can raise exceptions if a deadlock would happen.
> threadSafeForkRegisterAtFork()
> /usr/local/fbcode/platform007/lib/python3.7/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__
and __path__
> return f(*args, **kwds)
> test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu (test_nn.TestNNDeviceTypeCPU) ... Couldn't download test skip set, leaving all tests enabled...
> ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 0.162s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/844425097242375
Summary (total time 5.54s):
PASS: 1
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
Did _not_ run with tpx. See https://fburl.com/tpx for details.
```
Reviewed By: dzhulgakov
Differential Revision: D22801881
fbshipit-source-id: 80a624465727081bb9bf55c28419695a3d79c6e5
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056
A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538
Reviewed By: zou3519
Differential Revision: D22608376
Pulled By: ezyang
fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
Summary:
This PR implements a feature extension discussed in https://github.com/pytorch/pytorch/issues/41516.
I followed this other PR https://github.com/pytorch/pytorch/issues/22245 to add this other module. While I was at it, I also added `extra_repr()` method in `Flatten` which was missing.
I see there are no unit tests for these modules. Should I add those too? If so, what is the best place I should place these?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41564
Reviewed By: gchanan
Differential Revision: D22636766
Pulled By: albanD
fbshipit-source-id: f9efdefd3ffe7d9af9482087625344af8f990943
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514
Reviewed By: ngimel
Differential Revision: D22569348
Pulled By: mruberry
fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
Summary:
BCELoss currently uses different broadcasting semantics than numpy. Since previous versions of PyTorch have thrown a warning in these cases telling the user that input sizes should match, and since the CUDA and CPU results differ when sizes do not match, it makes sense to upgrade the size mismatch warning to an error.
We can consider supporting numpy broadcasting semantics in BCELoss in the future if needed.
Closes https://github.com/pytorch/pytorch/issues/40023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41426
Reviewed By: zou3519
Differential Revision: D22540841
Pulled By: ezyang
fbshipit-source-id: 6c6d94c78fa0ae30ebe385d05a9e3501a42b3652
Summary:
Closes https://github.com/pytorch/pytorch/issues/36977
This avoid the division by zero that was causing NaNs to appear in the output. `AvgPooling2d` and `AvgPooling3d` both had this issue on CPU and CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41368
Reviewed By: ailzhang
Differential Revision: D22520013
Pulled By: ezyang
fbshipit-source-id: 3ece7829f858f5bc17c2c1d905266ac510f11194
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39342
Many networks such as resnet have adds followed by relu. This op is the
first step in enabling this fused implementation.
Once we have the fused add_relu op, a JIT pass will be written to
replace add + relu patterns with add_relu.
Test Plan:
python test/test_nn.py TestAddRelu
Imported from OSS
Differential Revision: D21822397
fbshipit-source-id: 03df83a3e46ddb48a90c5a6f755227a7e361a0e8
Summary:
fix https://github.com/pytorch/pytorch/issues/40227
Removed the sorting operation both in ModuleDict class, updated the docstring.
Also remove a sort operation in corresponding unit test, which will lead to unit test fail.
BC Note: Python version after 3.6, the plain dict will preserve the order of keys.
example:
For a python 3.6+ user, if he is initial a ModuleDict instance using plain python dict:
{
"b": torch.nn.MaxPool2d(3),
"a": torch.nn.MaxPool2d(3)
}
, he will get a ModuleDict which preserve the order:
ModuleDict(
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
For a python 3.5 user, if we maintain the same input, then the output ModuleDict could be:
ModuleDict(
(a): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
(b): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40905
Differential Revision: D22357480
Pulled By: albanD
fbshipit-source-id: 0e2502769647bb64f404978243ca1ebe5346d573
Summary:
This PR aims at tackling https://github.com/pytorch/pytorch/issues/37823 by:
- ensuring that buffers will be used for normalization computation but won't be updated, when buffers are not None, and `track_running_stats=False`
- adding a corresponding unittest to ensure expected behaviour
Any feedback is welcome!
_Note: we might want to update the docstrings of `BatchNorm*d`, feel free to share any suggestion!_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38084
Differential Revision: D22047871
Pulled By: ezyang
fbshipit-source-id: 5acbcad9773e7901f26d625db71d43d7dc236d3e
Summary:
This allows registering hooks that will be executed for every module.
This idea arose in a discussion with tkerola and niboshi kindly proposed this approach.
The use case for this is to avoid boilerplate code when registering the same hook for all the modules in a complex model, the internal use-case was to allow every model to accept a NumPy array in the forward pass in a simpler way. Other use cases involve general mechanisms for plotting or tracing & debugging.
Currently, this is shared for all the modules but this can be worked out to have the hooks shared only per type of module.
If this functionality is not needed feel free to close the PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38972
Differential Revision: D22091364
Pulled By: albanD
fbshipit-source-id: 204ff5f9e119eff5bdd9140c64cb5dc467bb23a2
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.
In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872
Differential Revision: D21740237
Pulled By: mruberry
fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
Summary:
Fix https://github.com/pytorch/pytorch/issues/38764
The current problem is that, `top_diff` and `top_mask` pointers are shifted "accumulatively" with for-n and for-c loops. This may cause overflow and illegal memory access when the loop counts are greater than one, that is n > 65535 or c > 65535 (the case in https://github.com/pytorch/pytorch/issues/38764). Since neither of n > 65535 or c > 65535 is common, it has not been seen before. The simple fix would be using new pointer variables for the n & c offset instead of directly modifying `top_diff` or `top_mask`.
However, I think the current nchw max_pool2d GPU impl still has plenty of room for performance improvement. We can check that in a later PR if needed.
Slightly clean up the indentation. Also add tests to use CPU impl as a reference check.
cc skrah
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38953
Differential Revision: D21721930
Pulled By: ezyang
fbshipit-source-id: fef7d911d814f8ed9fd67c60cabe5d52f8fd3d57
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.
In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872
Differential Revision: D21717199
Pulled By: mruberry
fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38839. Previously, if magnitude of input values was large, when computing `max+log(sum)` the `log(sum)` value was essentially ignored, now the result is computed as
`x-max-log(sum)` which has a better chance of preserving accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38945
Differential Revision: D21712483
Pulled By: ngimel
fbshipit-source-id: c1a3599ed981ba7a7fd130cbd7040a706b7eace0
Summary:
CC ezyang xw285cornell sunway513
Commit 59d92e442b (https://github.com/pytorch/pytorch/issues/38557) has caused this test to regularly fail on ROCm CI gfx900 hosts. Skipping test until root cause analysis can complete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38724
Differential Revision: D21645815
Pulled By: xw285cornell
fbshipit-source-id: 4087e9565710c271ca5c026a5ae0c5132e56f44d
Summary:
Per title.
We move all the individual gradient norms to a single device before stacking (no-op if all the gradients are already on a single device), `clip_coef` is copied to the device of gradient, which may be suboptimal as there could be multiple copies, but no worse than when we were synchronizing for each parameter. In a simple case of all gradients on a single device, there should be no synchronization.
Also, we no longer error out if parameter list is empty or none of the parameters have gradients, and return 0 total_norm instead.
Fixes https://github.com/pytorch/pytorch/issues/38605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38615
Reviewed By: ailzhang
Differential Revision: D21634588
Pulled By: ngimel
fbshipit-source-id: ea4d08d4f3445438260052820c7ca285231a156b
Summary:
Edit: this has been updated to reflect the PR's current status, which has changed after review.
This PR updates the behavior of the assertEqual, assertNotEqual, and assert_allclose to be consistent with each other and torch.isclose. It corrects several additional bugs in the current implementations and adds extensive testing and comments, too.
These updates follow from changes to assertEqual like https://github.com/pytorch/pytorch/pull/34258 and https://github.com/pytorch/pytorch/pull/37069, and from our discussion of torch.isclose for complex tensors (see https://github.com/pytorch/pytorch/issues/36462), where we decided to implement a NumPy-compatible mathematical notion of "closeness" for complex tensors that is not a great fit for our testing framework.
The detailed changelist is:
- New test framework functions for comparing tensors and scalars
- Tensors are compared using isclose; the real and imaginary parts of complex tensors are compared independently
- Scalars are compared using the same algorithm
- assertEqual and assert_allclose now use this common comparison function, instead of each implementing their own with divergent behavior
- assertEqual-like debug messages are now available for all tensor and scalar comparisons, with additional context when comparing the components of sparse, quantized, and complex tensors
- Extensive testing of the comparison behavior and debug messages
- Small Updates
- assertEqual now takes an "exact_device" argument, analogous to "exact_dtype", which should be useful in multidevice tests
- assertEqual now takes an "equal_nan" argument for argument consistency with torch.isclose
- assertEqual no longer takes the "allow_inf" keyword, which misleadingly only applied to scalar comparisons, was only ever set (rarely) to true, and is not supported by torch.isclose
- Bug fixes:
- the exact_dtype attribute has been removed (no longer needed after https://github.com/pytorch/pytorch/pull/38103)
- message arguments passed to assertEqual are now handled correctly
- bool x other dtype comparisons are now supported
- uint8 and int8 tensor comparisons now function properly
- rtol for integer comparisons is now supported (default is zero)
- rtol and atol for scalar comparisons are now supported
- complex scalar comparisons are now supported, analogous to complex tensor comparisons
- assertNotEqual is now equivalent to the logical negation of assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37294
Differential Revision: D21596830
Pulled By: mruberry
fbshipit-source-id: f2576669f7113a06f82581fc71883e6b772de19b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35620
Python 2 has reached end-of-life and is no longer supported by PyTorch.
`self.subTest` can be used directly in Python 3.
Test Plan: CI
Differential Revision: D20842872
Pulled By: dreiss
fbshipit-source-id: 6ad42550c01e6959821ff07df767fc14b58c5a9e
Summary:
Add read/write vectorization to non-persistent softmax kernels only. At this point launch logic has minimal changes, and `ILP=vectorization=2` is always used (the code can handle other values, but `ILP=2` has been the most consistent performer).
Dispatch to persistent / non-persistent kernels is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36485
Differential Revision: D21477775
Pulled By: ngimel
fbshipit-source-id: 9ff7fd243695d7bbf4121390085b64db0bbdef35
Summary:
Fix https://github.com/pytorch/pytorch/issues/37680
Makes two changes:
- Add `argmin`, `argmax` and `argsort` to the list of non-differentiable functions to prevent them from generating outputs that requires_grad.
- Add a check to make sure we don't add such functions to the codegen by mistake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37789
Differential Revision: D21389201
Pulled By: albanD
fbshipit-source-id: 6a7617e389e893f6f813d50f02700d32300b1386