Summary:
* Create a private global-scope function _generate_subsequent because static class attribute member functions not supported by TorchScript resulting in torchscripting errors.
* Make TransformerEncoder and TransformerDecoder consistent w.r.t. is_causal handling by calling _detect_casual_mask
* Clarify documentation that is_causal is a hint
* Move causal mask detection into a method _detect_causal_mask
* only accept input-size compatible causal mask as causal mask
* update _generate_subsequent_causal_mask to include factory kwargs for dtype and device:
avoid extra copies & conversions by passing directly to torch.full.
Test Plan: sandcastle & github CICD
Continuation of #101487 (due to a tooling issue) which is a continuation-in-part of https://github.com/pytorch/pytorch/pull/98327 by @janEbert
Differential Revision: D47427117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105265
Approved by: https://github.com/mikaylagawarecki
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
This PR is to fix https://github.com/pytorch/pytorch/issues/101935.
Only when input, parameters and hidden states are all in CPU device, LSTM will go into oneDNN fast path implementation. Otherwise, it will fallback to the original implmentation.
Note here, if input and parameters are indeed not in the same device, it will encounter Error `Input and parameter tensors are not at the same device, found input tensor......` in `check_attributes`. Therefore, the proper usage of LSTM is `input.to(device)` and `model.to(device)` together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102050
Approved by: https://github.com/XiaobingSuper, https://github.com/albanD
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
Fixes#64601 and #98906
Adds an `assign` argument to `load_state_dict` that loads params/buffers by assignment instead of doing `param.copy_(param_from_state_dict)`.
Primarily intended to remove the need for the `.to_empty()` in
```
with torch.device('meta'):
m = SomeModule()
m.to_empty()
state_dict = torch.load('...pth')
m.load_state_dict(state_dict)
```
so we can instead do
```
with torch.device('meta'):
m = SomeModule()
state_dict = torch.load('...pth')
m.load_state_dict(state_dict, assign=True)
```
**A problem with this PR for the case where the model is initialized on meta is what happens to nonpersistent buffers/params corresponding to keys missing from the state dict?**
What happens in the case where `load_state_dict(state_dict, strict=False, assign=True)` and the state_dict is missing some keys? The corresponding params missing from the `state_dict` and nonpersistent buffers would still be on `meta` and need to be manually initialized. However, I don't think we offer an API that would initialize these.
One solution would be to make these empty tensors but it might not be semantically correct...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102212
Approved by: https://github.com/albanD
## Description
This is a bug fix for rare cases that can happen with specific scale, antialias=False, output for a random line can be wrong. For example:
```
line 14
output uint8: [76, 78, 80, 81, 83, 85, 87, 88, 90]
expected float: [149, 152, 155, 158, 161, 164, 167, 170, 173]
diff: [-73, -74, -75, -77, -78, -79, -80, -82, -83]
opencv ref: [149 152 155 158 161 164 167 170 173]
```
It appears that for this line we have 3 weights coeff instead of 2:
```
line 13 | 351, 2
k: 1130 15254
line 14 | 378, 3
k: 0 16384 -6780 <------- We should have 2 weights and not 3
line 15 | 432, 2
k: 15254 1130
```
which comes from our `_compute_weights_aa` function that is specifically used for AA=False and uint8.
```
xmin = std::max(
static_cast<int64_t>(center - support + 0.5 + align_corners_delta), static_cast<int64_t>(0));
xsize = std::min(
static_cast<int64_t>(center + support + 0.5 + align_corners_delta), input_size) - xmin;
```
```
center - support + 0.5 + align_corners_delta: 14.999999999999998
static_cast<int64_t>(center - support + 0.5 + align_corners_delta): 14
xmin -> 14
center + support + 0.5 + align_corners_delta: 17.0
static_cast<int64_t>(center + support + 0.5 + align_corners_delta): 17.0
xsize -> 17 - 14 = 3 <------ 3 instead of 2
```
For float dtype, AA=False weights and indices are computed differently due to historically first implemented.
In any case, `xsize` should not be larger than `max_interp_size`, so we decided to clip `xsize`.
Once fixed computed indices and weights are same as for float dtype code path:
```
# Option: xsize = min(xsize, max_interp_size)
Line Num | xmin, xsize
14 | 378, 2 xmin=378 <---> xmin = i * stride = i * 3 * 9 => i = 14
k: 0 16384 16384 = w * (1 << 14) => w = 1.0
=> i=14, w=0 and i=15, w=1
```
vs
```
Line Num | index0, index1
F32: 14 | 15, 16
F32: lambda0, lambda1: 0.999999, 9.53674e-07
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101403
Approved by: https://github.com/NicolasHug
### Description
This PR is to fix#99413, which shows the limitation of double backward using oneDNN in LSTM.
This PR does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements mkldnn_rnn_layer_backward using differentiable operations, so that double backward can be done automatically.
During backward process, it needs to use gates and hidden states between cells during one layer. However, these middle variables are stored in the `workspace`, and it is hard to figure them out. Therefore, in backward, we need re-calculate them first.
Corresponding UT has been added based on the failing case in # 99413. The UT with gradcheck and gradgradcheck which is added in https://github.com/pytorch/pytorch/pull/26660 cannot test LSTM using oneDNN, because UT only supports `double` datatype, while oneDNN does not support it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100627
Approved by: https://github.com/jgong5, https://github.com/soulitzer
Description:
- Fixed a bug with memory format issue:
When input is channels last 4d tensor that was produced as following
```
t = torch.ones(1, 3, 32, 32).contiguous(memory_format=torch.channels_last)
t = t[0]
t = t[None, ...]
```
upsampling will produce output with channels first memory format but our avx code does not take that into account.
Here is a repro code to show that nightly is broken for this particular case:
```python
import torch
torch.manual_seed(0)
input = torch.randint(0, 256, size=(1, 3, 256, 256), dtype=torch.uint8).contiguous(memory_format=torch.channels_last)
input = input[0]
input = input[None, ...]
assert input.is_contiguous(memory_format=torch.channels_last)
output = torch.nn.functional.interpolate(input, (224, 224), mode="bilinear", antialias=True)
expected = torch.nn.functional.interpolate(input.float(), (224, 224), mode="bilinear", antialias=True)
assert output.is_contiguous()
assert expected.is_contiguous()
torch.testing.assert_close(expected, output.float(), atol=1, rtol=1)
# >
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "/pytorch/torch/testing/_comparison.py", line 1511, in assert_close
# raise error_metas[0].to_error(msg)
# AssertionError: Tensor-likes are not close!
#
# Mismatched elements: 14120 / 150528 (9.4%)
# Greatest absolute difference: 214.6112518310547 at index (0, 1, 152, 13) (up to 1 allowed)
# Greatest relative difference: 17.005144119262695 at index (0, 2, 26, 2) (up to 1 allowed)
```
- Also renamed needs_unpacking by skip_unpacking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100258
Approved by: https://github.com/NicolasHug
Issue: #93684
# Problem
Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.
# Design (as I know it)
* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.
This PR adds `NumpyTensorVariable` and supports:
1. tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`
Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
Some modules like lazyModule may override '_save_to_state_dict()', in this case, pre_state_dict hook will not be called. So move the pre_state_dict hook out from '_save_to_state_dict()' to make sure the pre hook could be called
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98964
Approved by: https://github.com/albanD
Fixes#99148 , raising an error if output_ratio's size > 2.
Justification for changes:
If an output size is not specified but an output ratio is, we call fractional_max_pool2d_with_indices. We then generate the value of output_size based on the first two integers of the output_ratio (line ~480 of torch.nn.functional.py).
Thus, we should raise a value error in the case that the user passes an output_ratio (instead of an output_size) and the number of elements in output_ratio exceeds two. We must raise an error before calling torch._C._nn.franctional_max_pool2d as the value of output_size passed into torch._C._nn.fractional_max_pool2d is guaranteed to be of size 2 (as the existing code generates it from the first two indices of the passed in ratio).
I would be happy to iterate on this if there are any issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99507
Approved by: https://github.com/mikaylagawarecki
## BC-breaking note:
This is technically a bugfix. Prior to this PR, for `torch.nn.functional.grid_sample(mode='nearest')` the 2D kernel used `std::nearbyint` whereas the 3D kernel used `std::round` in order to determine the nearest pixel locations after un-normalization of the grid. This PR fixes the 3D kernel to use `std::nearbyint` which rounds values that are exactly `<>.5` to the nearest even which is consistent with the behavior of `torch.round`. Unnormalized indices that are exactly `<>.5` will now be rounded to the nearest even instead of being rounded away from 0.
## Description
In the nearest neighbor interpolation mode, the 2D GridSample rounds index to the nearest even using [std::nearbyint](https://github.com/pytorch/pytorch/blob/v2.0.0/aten/src/ATen/native/cpu/zmath.h#L182) whereas the 3D GridSample rounds index away from zero using std::round. This discrepancy needs to be resolved. We are making both 2D GridSample and 3D GridSample to round to the nearest even.
## Unit Test Goals
1. Make sure the x dimension and y dimension rounding behaviors are the same for 2D GridSample.
2. ~~Make sure the 2D GridSample rounding mode is rounding to the nearest even.~~
3. Make sure the x dimension, y dimension, and z dimension rounding behaviors are the same for 3D GridSample.
4. ~~Make sure the 3D GridSample rounding mode is rounding to the nearest even.~~
5. The 2D GridSample and 3D GridSample rounding behaviors are exactly the same.
After some experiments, I found 2 and 4 are difficult to achieve. Even though I can compute the normalized coordinates corresponding to the unnormalized coordinates including [0, 0.5, 1.0, 1.5, 2.0, 2.5, ..., 10.0], the unnormalization process in the GridSample implementations always have a small chance of having floating point error. Therefore, it's not possible to unit test the rounding mode from the normalized coordinates.
## Unit Test Methods
The unit test is simple. By using the same values along the dimension to be tested in the input tensor and the same normalized indices in the grid tensor. The interpolation on the 2D GridSample x-dimension, 2D GridSample y-dimension, 3D GridSample x-dimension, 3D GridSample y-dimension, 3D GridSample z-dimension. Should produce exactly the same interpolated values.
If one CPU/CUDA 2D/3D implementation use a different rounding mode from others, the unit test shall fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97000
Approved by: https://github.com/mikaylagawarecki
Summary:
This fixes an issue raised in [is_causal parameter in torch.nn.TransformerEncoderLayer.forward does not work #96941](https://github.com/pytorch/pytorch/issues/96941) where results computed with is_causal do not properly reflect causal masking.
In PyTorch 2.0, Accelerated PT Transformers added the is_causal parameter to legacy nn.Transformer* and nn.MHA APIs aligned with and intended to engage the is_causal parameter of the new scaled_dot_product_attention (SDPA) operator.
At present is_causal works differently for Transformer* modules, the nn.MHA and F.MHA:
* The nn.Transformer* modules treat is_causal as an optional indicator about the format of attn_mask. This is because some layers (such as the CLIP layer use the attention mask in the layer, and thus the attn_mask was a required feature.)
* Initially, nn.MHA and F.MHA were defined to align with F.SDPA in behavior: a user may specify either the attention mask, or is_causal, but not both. It seemed to make sense at the time to align SDPA and MHA, esp since there was a larger overlap of parameters which have since changed, e.g., with the removal of need_weights from SDPA. (See below for why this makes sense.)
Unfortunately, this does not work because of how MHA was changed to handle the need_weights parameter. When need_weights is present, we do not (any more) call SDPA because support for need_weights was removed from SDPA before the release. The rationale is that need_weights defeats all optimization at the foundation of SDPA performance. Having the flag might thus mislead users into thinking they get good performance and have them disappointed when they enable a legacy feature of MHA which massively degrades performance. (They might not think anything of enabling that, because it is on by default in MHA today, which leads to more issues.)
Since SDPA does not (no longer) support need_weights, we need to pick a separate path which implements attention using a set of discrete operations that allocates a tensor for weights. Alas, this code path does not have support for is_causal, because attention is implemented as matmul and using the attention mask. Thus, is_causal has no impact. (A substantially similar situation arises with how kpm is implemented today because Nested Tensors are not supported by torch.compile() in 2.0)
This problem was masked because all uses of legacy nn.MHA (and F.MHA) come through nn.Transformer* which called self-attention (i.e., nn.MHA) only ever with the attention mask attn_mask, and never with is_causal, a missed optimization opportunit that would have been addressed in a future performance update.
Regrettably, always calling nn.MHA with attn_mask prevented diagnosing of the issue of not having a suitable attention mask when need_weights support was dropped from SDPA and a discrete implementation of attention was added for that scenario, and for the execution path with key_padding_mask.
We have two options to address this issue:
Solution 1: Whenever nn.MHA and F.MHA are executed with is_causal set, we internally create a causal mask at significant expense of allocating a tensor and filling it with a triangular causal matrix. This increases memory usage, and runtime, for allocating a causal mask. To add insult to injury, in all current (and likely future) execution scenarios, MHA is called by a model using the nn.Transformer API which already has that matrix and passes it from nn.module to nn.module. Then the passing in of attn_mask has to be suppressed by nn.TransformerEncoderLayer, only for nn.MHA to immediately allocate the very same tensor again to satisfy the requirement to have an attention mask for the computation. (We expect new use cases to use SDPA directly.)
Solution 2: We align the behavior of nn.MHA and F.MHA with the rest of the existing nn.Transformer API, and require the attention mask to be passed into nn.MHA in addition to is_causal as an optional indicator about the nature of the attention mask rather than as an alternative to attn_mask. Then, when we choose the code path for processing MHA with need_weights or a key_padding_mask, we have the attn_mask passed down through the nn.Transformer* hierarchy, without the added overhead of allocating an attention mask as in scenario 1.
This PR implements solution 2 which offers better performance and in retrospect aligns MHA better with the rest of the Transformer modules as the definition of SDPA evolved into a more streamlined high-performance operator. It ostensibly changes how is_causal works, by requiring the attention mask to be specified. However, as described here, and as shown in the submitted issue, is_causal is not working as intended today, so it requires a change regardless.
In that sense, a change in API does not occur per-se, as the current implementation is not working, and a change has to occur either way to resolve the submitted issue, breaking any use cases that depend on the current implementation. Checks exist (and more can be added) that flag any scenarios where is_causal is passed as True, but no attention mask is provided, ensuring that there's not quiet change from even the faulty behavior present in 2.0.
As an upside, the present implementation will improve performance by addressing the passing of the is_causal flag from Transformer modules to MHA, speeding up training for these examples, e.g., finetuning BERT, RoBERTa, XLM-R models.
Differential Revision: D44245725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97214
Approved by: https://github.com/albanD
Fixes#96813.
Comments:
1. Wasn't able to test since tools/nightly.py does not allow for GPU build (and I don't want to build from scratch).
2. In theory, the bug (i.e. NaNs) can still occur when beta is very small (e.g. `beta=1e-50`), but not sure whether anybody cares.
3. Some checks within the smooth_l1_loss C++ code could be changed to check for `beta > 0` instead of `beta >= 0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97022
Approved by: https://github.com/jbschlosser
Fixes#96429
This PR is also a follow up for #90427. In that PR, we also discussed whether calculations of grid indices `grid_sampler_compute_source_index` should also be upcasted to `opmath_t` https://github.com/pytorch/pytorch/pull/90427/files#r1048876708. Due to another unit test failure, we didn't upcast those calculations in that PR.
After some investigations, I found that the inaccurate results have nothing to do with the internals of `affine_grid`, even if it's calculated using `double` internally. As long as input `grid` is passed to `grid_sample` in **half** precision, the results will be less inaccurate than a **float** `grid`. This can be verified with a short C++ program like this (by setting `TYPE_T` to `__half` and `float` in compilations)
```cpp
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <iostream>
#ifndef TYPE_T
#define TYPE_T float
#endif
int main() {
using type_t = TYPE_T;
type_t d = static_cast<__half>((double)2.0 / 3.0);
type_t s = (((float)d + 1.f) * 3 - 1) / 2;
printf("%.15f %.15f\n", (double)d, (double)s);
}
```
Outputs are
```
./float.out
0.666503906250000 1.999755859375000
./half.out
0.666503906250000 2.000000000000000
```
To resolve the discussion back in https://github.com/pytorch/pytorch/pull/90427/files#r1048876708, I've also increased the test tolerance in the failed unit test `issue_24823_1(torch.half)`.
For the original script in #96429, I got more accurate results with `align_corners = True`
```
align_corners = True
Expected result has mean absolute value of 0.5285 and maximum absolute value of 3.2067.
Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum.
align_corners = False
Expected result has mean absolute value of 0.5189 and maximum absolute value of 3.0101.
Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96586
Approved by: https://github.com/ngimel
Fixes#88951
The output shape of upsample is computed through `(i64)idim * (double)scale` and then casted back to `i64`. If the input scale is ill-formed (say negative number as #88951) which makes `(double)(idim * scale)` to be out of the range for `i64`, the casting will be an undefined behaviour.
To fix it, we just check if `(double)(idim * scale)` can fit into `i64`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94290
Approved by: https://github.com/malfet
Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323
Approved by: https://github.com/albanD
Attempts to fix#92656
BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel