Summary:
- Added 2D-Convolution NHWC support
- on ROCm 4.3, with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` flag
- May need to force MIOpen to search for solutions ( see examples below for flags )
**PYTORCH_MIOPEN_SUGGEST_NHWC Environment Flag**
MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. This flag is intended to be a short-lived flag to explicitly turn on NHWC support until ROCm officially supports NHWC and performance is verified.
**Examples**
1. Example usage 1 : Run test on ROCm4.3
`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_MIOPEN_SUGGEST_NHWC=1 MIOPEN_FIND_ENFORCE=4 MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_MODE=1 pytest test_nn.py -v -k "test_conv_cudnn_nhwc" `
2. Example usage 2: Run the following with `PYTORCH_MIOPEN_SUGGEST_NHWC=1` on ROCm4.3.
```
#!/usr/bin/env python3
import torch
model = torch.nn.Conv2d(8, 4, 3).cuda().half()
model = model.to(memory_format=torch.channels_last)
input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)
# should print True for is_contiguous(channels_last), and strides must match NHWC format
print(input.is_contiguous(memory_format=torch.channels_last), input.shape, input.stride() )
out = model(input)
# should print True for is_contiguous(channels_last), and strides must match NHWC format
print("Contiguous channel last :", out.is_contiguous(memory_format=torch.channels_last), " out shape :", out.shape, "out stride :", out.stride() )
```
See https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html for more examples.
cc jeffdaily sunway513 jithunnair-amd ROCmSupport
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63617
Reviewed By: saketh-are
Differential Revision: D30730800
Pulled By: ezyang
fbshipit-source-id: 61906a0f30be8299e6547d312ae6ac91cc7c3238
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554
Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:
1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.
We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D30662206
Pulled By: mruberry
fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64385
It was deleted in https://github.com/pytorch/pytorch/pull/63276.
The numerics test was meant to check LayerNorm behavior on large inputs,
but we deleted it without realizing that.
Test Plan: - wait for tests.
Reviewed By: ngimel
Differential Revision: D30702950
Pulled By: zou3519
fbshipit-source-id: a480e26c45ec38fb628938b70416cdb22d976a46
Summary:
Implements an orthogonal / unitary parametrisation.
It does passes the tests and I have trained a couple models with this implementation, so I believe it should be somewhat correct. Now, the implementation is very subtle. I'm tagging nikitaved and IvanYashchuk as reviewers in case they have comments / they see some room for optimisation of the code, in particular of the `forward` function.
Fixes https://github.com/pytorch/pytorch/issues/42243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62089
Reviewed By: ezyang
Differential Revision: D30639063
Pulled By: albanD
fbshipit-source-id: 988664f333ac7a75ce71ba44c8d77b986dff2fe6
Summary:
Fixes https://github.com/pytorch/pytorch/issues/64039
There are two distinct problems here.
1. If `grad_output` is channels last but not input, then input would be read as-if it were channels last. So reading the wrong values.
2. `use_channels_last_kernels` doesn't guarunte that `suggest_memory_format` will actually return channels last, so use `empty_like` instead so the strides always match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64100
Reviewed By: mruberry
Differential Revision: D30622127
Pulled By: ngimel
fbshipit-source-id: e28cc57215596817f1432fcdd6c49d69acfedcf2
Summary:
I think the original intention here is to only take effect in the case of align_corners (because output_size = 1 and the divisor will be 0), but it affects non-align_corners too. For example:
```python
input = torch.tensor(
np.arange(1, 5, dtype=np.int32).reshape((1, 1, 2, 2)) )
m = torch.nn.Upsample(scale_factor=0.5, mode="bilinear")
of_out = m(input)
```
The result we expect should be [[[[2.5]]]]
but pytorch get [[[[1.0]]]] which is different from OpenCV and PIL, this pr try to fixed it。
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61166
Reviewed By: malfet
Differential Revision: D30543178
Pulled By: heitorschueroff
fbshipit-source-id: 21a4035483981986b0ae4a401ef0efbc565ccaf1
Summary:
Fixes https://github.com/pytorch/pytorch/issues/62094
Introduces functionality for adding arbitrary objects to module state_dicts. To take advantage of this, the following functions can be defined on a module:
* `get_extra_state(self) -> dict` - Returns a dict defining any extra state this module wants to save
* `set_extra_state(self, state)` - Subsumes the given state within the module
In the details, a sub-dictionary is stored in the state_dict under the key `_extra_state` for each module that requires extra state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62976
Reviewed By: heitorschueroff
Differential Revision: D30518657
Pulled By: jbschlosser
fbshipit-source-id: 5fb35ab8e3d36f35e3e96dcd4498f8c917d1f386
Summary:
Interestingly enough, the original code did have a mechanism that aims to prevent this very issue:
but it performs a clone AFTER modifying u and v in-place.
This wouldn't work though because we can later use the cloned u and v in operations that save for backward, and the next time we execute forward, we modify the same cloned u and v in-place.
So if the idea is that we want to avoid modifying saved variable in-place we should clone it BEFORE the in-place operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62293
Reviewed By: bdhirsh
Differential Revision: D30489750
Pulled By: soulitzer
fbshipit-source-id: cbe8dea885aef97adda8481f7a822e5bd91f7889
Summary:
As discussed here https://github.com/pytorch/pytorch/pull/62897, in the path of BF16/non-last-dim Softmax, we miss the subtractions of max value which will cause the overflow in the `exp()` calculation when the value of input tensor is large, such as `1000.0`.
To avoid this issue, we add the subtractions of max value and the corresponding test cases in this PR.
Note w/o subtractions of max value(accidental reverts or changes), we will get the underlying error message of the test case
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.05 and atol=0.05, found 103984 element(s) (out of 126720) whose difference(s) exceeded the margin of error (including 103984 nan comparisons). The greatest difference was nan (0.0 vs. nan), which occurred at index (0, 0, 0, 1).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63132
Reviewed By: VitalyFedyunin
Differential Revision: D30280792
Pulled By: cpuhrsch
fbshipit-source-id: 722821debf983bbb4fec878975fa8a4da0d1d866
Summary:
This issue fixes a part of https://github.com/pytorch/pytorch/issues/12013, which is summarized concretely in https://github.com/pytorch/pytorch/issues/38115.
This PR allows `MaxPool` and `AdaptiveMaxPool` to accept tensors whose batch size is 0. Some changes have been made to modernize the tests so that they will show the name of C++ function that throws an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62088
Reviewed By: bdhirsh
Differential Revision: D30281285
Pulled By: jbschlosser
fbshipit-source-id: 52bffc67bfe45a78e11e4706b62cce1469eba1b9
Summary: skip rocm test for test_cudnn_convolution_relu
Test Plan: This skips a test
Reviewed By: ngimel
Differential Revision: D30233620
fbshipit-source-id: 31eab8b03c3f15674e0d262a8f55965c1aa6b809
Summary:
Currently when cudnn_convolution_relu is passed a channels last Tensor it will return a contiguous Tensor. This PR changes this behavior and bases the output format on the input format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62482
Reviewed By: ngimel
Differential Revision: D30049905
Pulled By: cpuhrsch
fbshipit-source-id: 98521d14ee03466e7128a1912b9f754ffe10b448
Summary:
Enable Gelu bf16/fp32 in CPU path using Mkldnn implementation. User doesn't need to_mkldnn() explicitly. New Gelu fp32 performs better than original one.
Add Gelu backward for https://github.com/pytorch/pytorch/pull/53615.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58525
Reviewed By: ejguan
Differential Revision: D29940369
Pulled By: ezyang
fbshipit-source-id: df9598262ec50e5d7f6e96490562aa1b116948bf
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11959
Alternative approach to creating a new `CrossEntropyLossWithSoftLabels` class. This PR simply adds support for "soft targets" AKA class probabilities to the existing `CrossEntropyLoss` and `NLLLoss` classes.
Implementation is dumb and simple right now, but future work can add higher performance kernels for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61044
Reviewed By: zou3519
Differential Revision: D29876894
Pulled By: jbschlosser
fbshipit-source-id: 75629abd432284e10d4640173bc1b9be3c52af00
Summary:
Fixes Python part of https://github.com/pytorch/pytorch/issues/60747
Enhances the Python versions of `Transformer`, `TransformerEncoderLayer`, and `TransformerDecoderLayer` to support callables as their activation functions. The old way of specifying activation function still works as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61355
Reviewed By: bdhirsh
Differential Revision: D29967302
Pulled By: jbschlosser
fbshipit-source-id: 8ee6f20083d49dcd3ab432a18e6ad64fe1e05705
Summary:
Here is the PR to enable the softmax calculation with data type of `bfloat16` when not along the last dim.
* Use bf16 specialization for forward calculation to reduce the bf16/fp32 cast in vec template.
* Release the bf16 limitation for backward calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60371
Reviewed By: ejguan
Differential Revision: D29563109
Pulled By: cpuhrsch
fbshipit-source-id: f6b439fa3850a6c633f35db65ea3d735b747863e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62281
Closes gh-24646, Closes gh-24647
There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.
I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')
for _ in range(100):
torch.nn.functional.conv2d(x, w, groups=10)
```
and similarly for backwards. I see these as the same to within measurement error.
| | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
| Forward | 133.5 | 133.6 |
| Backward (input) | 1,102 | 1,119 |
| Backward (weight) | 2,220 | 2,217 |
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D29943062
Pulled By: ngimel
fbshipit-source-id: fc5d16496eb733743face7c5a14e532d7b8ee26a
Summary:
Part of the fix for https://github.com/pytorch/pytorch/issues/12013
Checks if the inputs and outputs are non-zero in order to allow the Bilinear layer to accept 0-dim batch sizes. The if-check for this checks for both input and output dim sizes since the `_trilinear` function is written to work with both forward and backward for Bilinear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47106
Reviewed By: ejguan
Differential Revision: D29935589
Pulled By: jbschlosser
fbshipit-source-id: 607d3352bd4f88e2528c64408f04999960be049d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62006
Closes gh-24646, gh-24647
There is no `TensorIterator` equivalent to these kernels so this is just
migrating the existing kernels over to the ATen style.
I've benchmarked for contiguous tensors with this script:
```
import torch
shape = (10, 10, 100, 100)
x = torch.randn(*shape, device='cuda')
w = torch.randn((10, 1, 5, 5), device='cuda')
for _ in range(100):
torch.nn.functional.conv2d(x, w, groups=10)
```
and similarly for backwards. I see these as the same to within measurement error.
| | Master Forward (us) | This PR Forward (us) |
|------------------:|:-------------------:|:--------------------:|
| Forward | 133.5 | 133.6 |
| Backward (input) | 1,102 | 1,119 |
| Backward (weight) | 2,220 | 2,217 |
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D29883676
Pulled By: ngimel
fbshipit-source-id: 9b2ac62cdd8a84e1a23ffcd66035b2b2fe2374d8