pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Mikayla Gawarecki	ca7ece9b50	[easy] improve hint on error message in nn.Module.load_state_dict (#106042 ) Fix #105963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106042 Approved by: https://github.com/albanD	2023-07-27 19:56:02 +00:00
Nikita Karetnikov	eac9e1b35f	[OpInfo] add reference and error inputs for `multilabel_margin_loss` (#105523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523 Approved by: https://github.com/ezyang	2023-07-23 02:16:29 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Andrey Talman	c6653b65d8	Back out "Make adding buffers more like adding parameters (#104069 )" (#105581 ) Summary: D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/ with `TypeError: register_buffer() takes 3 positional arguments but 4 were given` Original commit changeset: d4b4069fbd38 Original Phabricator Diff: D47537831 Test Plan: ``` buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform ``` Reviewed By: atalman Differential Revision: D47600140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581 Approved by: https://github.com/mikaylagawarecki	2023-07-20 03:39:53 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Michael Gschwind	11b753af01	Refactor causal mask generation and detection for nn.transformer (#105265 ) Summary: * Create a private global-scope function _generate_subsequent because static class attribute member functions not supported by TorchScript resulting in torchscripting errors. * Make TransformerEncoder and TransformerDecoder consistent w.r.t. is_causal handling by calling _detect_casual_mask * Clarify documentation that is_causal is a hint * Move causal mask detection into a method _detect_causal_mask * only accept input-size compatible causal mask as causal mask * update _generate_subsequent_causal_mask to include factory kwargs for dtype and device: avoid extra copies & conversions by passing directly to torch.full. Test Plan: sandcastle & github CICD Continuation of #101487 (due to a tooling issue) which is a continuation-in-part of https://github.com/pytorch/pytorch/pull/98327 by @janEbert Differential Revision: D47427117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105265 Approved by: https://github.com/mikaylagawarecki	2023-07-19 01:26:50 +00:00
Danni Li	1b78f23a1a	Allow nn.ChannelShuffle to run without erroring on CUDA tensors (#105351 ) Summary: Include GPU support for `nn.ChannelShuffle` & update test. Fix: #104603 Test Plan: Please see GitHub Actions. Differential Revision: D47523764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105351 Approved by: https://github.com/mikaylagawarecki	2023-07-18 16:24:30 +00:00
ekamiti	32d422f335	Make adding buffers more like adding parameters (#104069 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069 Approved by: https://github.com/mikaylagawarecki	2023-07-17 17:59:05 +00:00
Nikita Karetnikov	0c89596e4f	[OpInfo] add reference and error inputs for `multi_margin_loss` (#104850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850 Approved by: https://github.com/ezyang	2023-07-14 21:16:09 +00:00
yanbing-j	3fe2b73416	Update use_mkldnn in LSTM op to avoid input and parameter not in the same device (#102050 ) This PR is to fix https://github.com/pytorch/pytorch/issues/101935. Only when input, parameters and hidden states are all in CPU device, LSTM will go into oneDNN fast path implementation. Otherwise, it will fallback to the original implmentation. Note here, if input and parameters are indeed not in the same device, it will encounter Error `Input and parameter tensors are not at the same device, found input tensor......` in `check_attributes`. Therefore, the proper usage of LSTM is `input.to(device)` and `model.to(device)` together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102050 Approved by: https://github.com/XiaobingSuper, https://github.com/albanD	2023-07-13 01:13:59 +00:00
Masaki Kozuki	6929e9e947	Use `int64_t` accordingly in `cunn_SoftMaxBackward` to avoid `int` overflow (#104270 ) Fixes #103501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104270 Approved by: https://github.com/malfet, https://github.com/mikaylagawarecki	2023-06-30 21:39:46 +00:00
cyy	54cb61f7d9	enable ASAN on some tests (#103647 ) Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang. The following cited doc explains the reasons. ``` -fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types which would overflow the destination. Because the range of representable values for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are conversions from floating point to integer types. -fsanitize=float-divide-by-zero: Floating point division by zero. This is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value, so is not included in -fsanitize=undefined. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647 Approved by: https://github.com/kit1980	2023-06-28 02:17:14 +00:00
Mikayla Gawarecki	b93ed8164e	Add non-recursive module.to_empty option (#104197 ) Fixes https://github.com/pytorch/pytorch/issues/97049, related to https://github.com/pytorch/pytorch/issues/104187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104197 Approved by: https://github.com/albanD	2023-06-26 21:47:22 +00:00
Ryan Smith	6bda97e2c1	Raise type error message for `interpolate` if `size` contains non-integer elements (#99243 ) Raise type error message for interpolate when output size is a tuple containing elements that are not `int` Fixes #98287 Check is only performed if `size` is an instance of `list` or `tuple`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99243 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/MovsisyanM, https://github.com/albanD	2023-06-23 00:48:45 +00:00
Mikayla Gawarecki	d1cecd9c32	Add assign kwarg to module.load_state_dict (#102212 ) Fixes #64601 and #98906 Adds an `assign` argument to `load_state_dict` that loads params/buffers by assignment instead of doing `param.copy_(param_from_state_dict)`. Primarily intended to remove the need for the `.to_empty()` in ``` with torch.device('meta'): m = SomeModule() m.to_empty() state_dict = torch.load('...pth') m.load_state_dict(state_dict) ``` so we can instead do ``` with torch.device('meta'): m = SomeModule() state_dict = torch.load('...pth') m.load_state_dict(state_dict, assign=True) ``` A problem with this PR for the case where the model is initialized on meta is what happens to nonpersistent buffers/params corresponding to keys missing from the state dict? What happens in the case where `load_state_dict(state_dict, strict=False, assign=True)` and the state_dict is missing some keys? The corresponding params missing from the `state_dict` and nonpersistent buffers would still be on `meta` and need to be manually initialized. However, I don't think we offer an API that would initialize these. One solution would be to make these empty tensors but it might not be semantically correct... Pull Request resolved: https://github.com/pytorch/pytorch/pull/102212 Approved by: https://github.com/albanD	2023-06-15 18:41:00 +00:00
Nicolas Hug	3766c04736	Add uint8 support for CPU images in interpolate(mode='bicubic') (#103252 ) CC @vfdev-5 Proposed strategy: Be as close as possible to PIL when `antialias=True`. Be as close as possible to float path when `antialias=False`. Ad-hoc tests: <details> ```py import random import torch import pytest import numpy as np from PIL import Image from torch.nn.functional import interpolate @pytest.mark.parametrize("C", (1, 3, 6)) @pytest.mark.parametrize("batch_size", (1, 4)) @pytest.mark.parametrize("memory_format", (torch.contiguous_format, torch.channels_last, "strided", "cropped")) @pytest.mark.parametrize("antialias", (True, False)) # @pytest.mark.parametrize("mode", ("bilinear", "bicubic",)) @pytest.mark.parametrize("mode", ("bicubic",)) @pytest.mark.parametrize("seed", range(100)) def test_resize(C, batch_size, memory_format, antialias, mode, seed): def test_resize(C, batch_size, memory_format, antialias, mode, seed): torch.manual_seed(seed) random.seed(seed) Hi = 2random.randint(3, 10) + random.randint(0, 30) Wi = 2random.randint(3, 10) + random.randint(0, 30) Ho = 2random.randint(3, 10) + random.randint(0, 30) Wo = 2random.randint(3, 10) + random.randint(0, 30) # print(Hi, Wi, Ho, Wo) img = torch.randint(0, 256, size=(batch_size, C, Hi, Wi), dtype=torch.uint8) if memory_format in (torch.contiguous_format, torch.channels_last): img = img.to(memory_format=memory_format, copy=True) elif memory_format == "strided": img = img[:, :, ::2, ::2] elif memory_format == "cropped": a = random.randint(1, Hi // 2) b = random.randint(Hi // 2 + 1, Hi) c = random.randint(1, Wi // 2) d = random.randint(Wi // 2 + 1, Wi) img = img[:, :, a:b, c:d] else: raise ValueError("Uh?") margin = 0 img = img.clip(margin, 255 - margin) out_uint8 = interpolate(img, size=[Ho, Wo], mode=mode, antialias=antialias) if antialias and C == 3: out_pil_tensor = resize_with_pil(img, Wo, Ho, mode=mode, antialias=antialias) atol = {"bicubic": 2, "bilinear": 1}[mode] # TODO: is 2 expected when comparing with PIL bicubic? Why not 1 as for bilinear? torch.testing.assert_close(out_uint8, out_pil_tensor, rtol=0, atol=atol) out_float = interpolate(img.to(torch.float), size=[Ho, Wo], mode=mode, antialias=antialias).round().clip(0, 255).to(torch.uint8) if mode == "bicubic": diff = (out_float.float() - out_uint8.float()).abs() assert diff.max() < 30 percent = .03 if antialias else .1 assert (diff > 2).float().mean() < percent mae = .4 if antialias else .8 assert diff.mean() < mae else: torch.testing.assert_close(out_uint8, out_float, rtol=0, atol=1) def resize_with_pil(batch, Wo, Ho, mode, antialias): resample = {"bicubic": Image.BICUBIC, "bilinear": Image.BILINEAR}[mode] out_pil = [ Image.fromarray(img.permute((1, 2, 0)).numpy()).resize((Wo, Ho), resample=resample) for img in batch ] out_pil_tensor = torch.cat( [ torch.as_tensor(np.array(img, copy=True)).permute((2, 0, 1))[None] for img in out_pil ] ) return out_pil_tensor ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103252 Approved by: https://github.com/vfdev-5, https://github.com/H-Huang, https://github.com/malfet, https://github.com/atalman	2023-06-12 18:25:33 +00:00
ecao	73fd7235ad	add function specializations for the case of parameters in BFloat16 data type (#100233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100233 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-05-31 02:01:07 +00:00
vfdev-5	7042e10215	Fixed issue with bicubic interpolation on uint8 input and antialising (#102296 ) Description: - Fixed issue with bicubic interpolation on uint8 input and antialising, discovered by @NicolasHug - Unified `_separable_upsample_generic_Nd_kernel_impl_single_dim` on `antialis` arg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102296 Approved by: https://github.com/NicolasHug	2023-05-30 14:57:19 +00:00
ecao	af1d437654	Improve precision and performance for BFloat16 upsampling (#91169 ) ### Description - Fix precision issue for BFloat16 upsampling: https://github.com/pytorch/pytorch/issues/89212 - Improve performance for BFloat16 upsampling. ### Testing data type: BFloat16 - Single core contiguous: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 14.47 \| 8.34 linear \| 2 \| [3, 200, 200] \| 3.69 \| 2.74 bilinear \| 2 \| [3, 5, 200, 200] \| 87.99 \| 49.05 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 171.02 \| 72.53 bicubic \| 2 \| [3, 3, 200, 200 ] \| 176.29 \| 78 channels last: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 17.70 \| 10.30 linear \| 2 \| [3, 200, 200] \| \ \| \ bilinear \| 2 \| [3, 5, 200, 200] \| 50.90 \| 18.83 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 121.56 \| 42.60 bicubic \| 2 \| [3, 3, 200, 200 ] \| 179.40 \| 80 - 20 cores contiguous: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 1.17 \| 1.01 linear \| 2 \| [3, 200, 200] \| 0.41 \| 0.26 bilinear \| 2 \| [3, 5, 200, 200] \| 7.19 \| 4.07 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 21.32 \| 9.33 bicubic \| 2 \| [3, 3, 200, 200 ] \| 178.67 \| 10 channels last: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 2.25 \| 1.55 linear \| 2 \| [3, 200, 200] \| \ \| \ bilinear \| 2 \| [3, 5, 200, 200] \| 20.17 \| 7.20 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 43.33 \| 15.66 bicubic \| 2 \| [3, 3, 200, 200 ] \| 176.76 \| 10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91169 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/Skylion007	2023-05-29 01:35:57 +00:00
ecao	3f4fee735a	add Half support for logsigmoid, threshold, elu, gelu, hardtanh, hardsigmoid, hardswish, hardshrink, softshrink, leakyrelu, softplus, glu, silu, mish, and prelu on CPU (#98745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98745 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel	2023-05-27 16:20:21 +00:00
ts	563d8058f4	Fix inconsistent torch.nn.MaxPool1d output on cpu and gpu (#99843 ) Fixes #99412 , correctly raising an error when an output of invalid size is calculated. Would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99843 Approved by: https://github.com/mikaylagawarecki	2023-05-15 20:27:43 +00:00
vfdev	a8ea4178ab	Fixed bug in interpolate when interpolation size is larger than max (#101403 ) ## Description This is a bug fix for rare cases that can happen with specific scale, antialias=False, output for a random line can be wrong. For example: ``` line 14 output uint8: [76, 78, 80, 81, 83, 85, 87, 88, 90] expected float: [149, 152, 155, 158, 161, 164, 167, 170, 173] diff: [-73, -74, -75, -77, -78, -79, -80, -82, -83] opencv ref: [149 152 155 158 161 164 167 170 173] ``` It appears that for this line we have 3 weights coeff instead of 2: ``` line 13 \| 351, 2 k: 1130 15254 line 14 \| 378, 3 k: 0 16384 -6780 <------- We should have 2 weights and not 3 line 15 \| 432, 2 k: 15254 1130 ``` which comes from our `_compute_weights_aa` function that is specifically used for AA=False and uint8. ``` xmin = std::max( static_cast<int64_t>(center - support + 0.5 + align_corners_delta), static_cast<int64_t>(0)); xsize = std::min( static_cast<int64_t>(center + support + 0.5 + align_corners_delta), input_size) - xmin; ``` ``` center - support + 0.5 + align_corners_delta: 14.999999999999998 static_cast<int64_t>(center - support + 0.5 + align_corners_delta): 14 xmin -> 14 center + support + 0.5 + align_corners_delta: 17.0 static_cast<int64_t>(center + support + 0.5 + align_corners_delta): 17.0 xsize -> 17 - 14 = 3 <------ 3 instead of 2 ``` For float dtype, AA=False weights and indices are computed differently due to historically first implemented. In any case, `xsize` should not be larger than `max_interp_size`, so we decided to clip `xsize`. Once fixed computed indices and weights are same as for float dtype code path: ``` # Option: xsize = min(xsize, max_interp_size) Line Num \| xmin, xsize 14 \| 378, 2 xmin=378 <---> xmin = i * stride = i * 3 * 9 => i = 14 k: 0 16384 16384 = w * (1 << 14) => w = 1.0 => i=14, w=0 and i=15, w=1 ``` vs ``` Line Num \| index0, index1 F32: 14 \| 15, 16 F32: lambda0, lambda1: 0.999999, 9.53674e-07 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101403 Approved by: https://github.com/NicolasHug	2023-05-15 15:55:42 +00:00
vfdev-5	a3700571e1	Fixed a bug in interpolate uint8 AVX2 on non-contig input (#101136 ) Description: - Fixed a bug in interpolate uint8 AVX2 on non-contig input - Added tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/101136 Approved by: https://github.com/NicolasHug	2023-05-12 17:17:10 +00:00
yanbing-j	36d91b5513	Add differentiable mkldnn_rnn_layer_backward to support double backward of LSTM (#100627 ) ### Description This PR is to fix #99413, which shows the limitation of double backward using oneDNN in LSTM. This PR does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements mkldnn_rnn_layer_backward using differentiable operations, so that double backward can be done automatically. During backward process, it needs to use gates and hidden states between cells during one layer. However, these middle variables are stored in the `workspace`, and it is hard to figure them out. Therefore, in backward, we need re-calculate them first. Corresponding UT has been added based on the failing case in # 99413. The UT with gradcheck and gradgradcheck which is added in https://github.com/pytorch/pytorch/pull/26660 cannot test LSTM using oneDNN, because UT only supports `double` datatype, while oneDNN does not support it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100627 Approved by: https://github.com/jgong5, https://github.com/soulitzer	2023-05-09 12:58:57 +00:00
vfdev-5	ff974cd962	Fixing interpolate on uint8 unsqueezed 3D CL tensor (#100258 ) Description: - Fixed a bug with memory format issue: When input is channels last 4d tensor that was produced as following ``` t = torch.ones(1, 3, 32, 32).contiguous(memory_format=torch.channels_last) t = t[0] t = t[None, ...] ``` upsampling will produce output with channels first memory format but our avx code does not take that into account. Here is a repro code to show that nightly is broken for this particular case: ```python import torch torch.manual_seed(0) input = torch.randint(0, 256, size=(1, 3, 256, 256), dtype=torch.uint8).contiguous(memory_format=torch.channels_last) input = input[0] input = input[None, ...] assert input.is_contiguous(memory_format=torch.channels_last) output = torch.nn.functional.interpolate(input, (224, 224), mode="bilinear", antialias=True) expected = torch.nn.functional.interpolate(input.float(), (224, 224), mode="bilinear", antialias=True) assert output.is_contiguous() assert expected.is_contiguous() torch.testing.assert_close(expected, output.float(), atol=1, rtol=1) # > # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/pytorch/torch/testing/_comparison.py", line 1511, in assert_close # raise error_metas[0].to_error(msg) # AssertionError: Tensor-likes are not close! # # Mismatched elements: 14120 / 150528 (9.4%) # Greatest absolute difference: 214.6112518310547 at index (0, 1, 152, 13) (up to 1 allowed) # Greatest relative difference: 17.005144119262695 at index (0, 2, 26, 2) (up to 1 allowed) ``` - Also renamed needs_unpacking by skip_unpacking Pull Request resolved: https://github.com/pytorch/pytorch/pull/100258 Approved by: https://github.com/NicolasHug	2023-05-04 13:28:33 +00:00
Larry Liu	687afeb686	[dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849 ) Issue: #93684 # Problem Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations. # Design (as I know it) * Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`. * Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent. This PR adds `NumpyTensorVariable` and supports: 1. tensor to ndarray, ndarray to tensor 2. numpy functions such as numpy.meshgrid() 3. ndarray attributes such as `itemsize`, `stride` Next PR will handle returning `np.ndarray` and add support for ndarray methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849 Approved by: https://github.com/ezyang	2023-04-27 16:18:35 +00:00
Yanli Zhao	9bc03db670	Move nn.module state dict pre hook (#98964 ) Some modules like lazyModule may override '_save_to_state_dict()', in this case, pre_state_dict hook will not be called. So move the pre_state_dict hook out from '_save_to_state_dict()' to make sure the pre hook could be called Pull Request resolved: https://github.com/pytorch/pytorch/pull/98964 Approved by: https://github.com/albanD	2023-04-26 16:51:13 +00:00
soulitzer	5ee5afb82c	Update channel shuffle to return alias instead of self as-is (#99745 ) Partially addresses https://github.com/pytorch/pytorch/issues/99655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99745 Approved by: https://github.com/albanD	2023-04-24 14:02:14 +00:00
ts	dbf0db958f	Fix torch.nn.FractionalMaxPool2d output_size error (#99507 ) Fixes #99148 , raising an error if output_ratio's size > 2. Justification for changes: If an output size is not specified but an output ratio is, we call fractional_max_pool2d_with_indices. We then generate the value of output_size based on the first two integers of the output_ratio (line ~480 of torch.nn.functional.py). Thus, we should raise a value error in the case that the user passes an output_ratio (instead of an output_size) and the number of elements in output_ratio exceeds two. We must raise an error before calling torch._C._nn.franctional_max_pool2d as the value of output_size passed into torch._C._nn.fractional_max_pool2d is guaranteed to be of size 2 (as the existing code generates it from the first two indices of the passed in ratio). I would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99507 Approved by: https://github.com/mikaylagawarecki	2023-04-21 14:38:25 +00:00
vfdev-5	5907173022	Updated upsampling test to use parametrize_test decorator (#97769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97769 Approved by: https://github.com/NicolasHug	2023-04-11 12:20:00 +00:00
Kiersten Stokes	2a48f43fe2	Add check for 0 to 1 inclusive for elements of target tensor in BCE loss (#97814 ) TODO for @mikaylagawarecki : add BC breaking description Fixes #87373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97814 Approved by: https://github.com/mikaylagawarecki	2023-04-05 23:26:09 +00:00
Lei Mao	937ba248eb	Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000 ) ## BC-breaking note: This is technically a bugfix. Prior to this PR, for `torch.nn.functional.grid_sample(mode='nearest')` the 2D kernel used `std::nearbyint` whereas the 3D kernel used `std::round` in order to determine the nearest pixel locations after un-normalization of the grid. This PR fixes the 3D kernel to use `std::nearbyint` which rounds values that are exactly `<>.5` to the nearest even which is consistent with the behavior of `torch.round`. Unnormalized indices that are exactly `<>.5` will now be rounded to the nearest even instead of being rounded away from 0. ## Description In the nearest neighbor interpolation mode, the 2D GridSample rounds index to the nearest even using [std::nearbyint](https://github.com/pytorch/pytorch/blob/v2.0.0/aten/src/ATen/native/cpu/zmath.h#L182) whereas the 3D GridSample rounds index away from zero using std::round. This discrepancy needs to be resolved. We are making both 2D GridSample and 3D GridSample to round to the nearest even. ## Unit Test Goals 1. Make sure the x dimension and y dimension rounding behaviors are the same for 2D GridSample. 2. ~~Make sure the 2D GridSample rounding mode is rounding to the nearest even.~~ 3. Make sure the x dimension, y dimension, and z dimension rounding behaviors are the same for 3D GridSample. 4. ~~Make sure the 3D GridSample rounding mode is rounding to the nearest even.~~ 5. The 2D GridSample and 3D GridSample rounding behaviors are exactly the same. After some experiments, I found 2 and 4 are difficult to achieve. Even though I can compute the normalized coordinates corresponding to the unnormalized coordinates including [0, 0.5, 1.0, 1.5, 2.0, 2.5, ..., 10.0], the unnormalization process in the GridSample implementations always have a small chance of having floating point error. Therefore, it's not possible to unit test the rounding mode from the normalized coordinates. ## Unit Test Methods The unit test is simple. By using the same values along the dimension to be tested in the input tensor and the same normalized indices in the grid tensor. The interpolation on the 2D GridSample x-dimension, 2D GridSample y-dimension, 3D GridSample x-dimension, 3D GridSample y-dimension, 3D GridSample z-dimension. Should produce exactly the same interpolated values. If one CPU/CUDA 2D/3D implementation use a different rounding mode from others, the unit test shall fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97000 Approved by: https://github.com/mikaylagawarecki	2023-04-05 18:47:03 +00:00
Michael Gschwind	c757647dd8	[Better Transformer] make is_causal a hint and force attn_mask to be set on `is_causal=True` in F.MHA (#97214 ) Summary: This fixes an issue raised in [is_causal parameter in torch.nn.TransformerEncoderLayer.forward does not work #96941](https://github.com/pytorch/pytorch/issues/96941) where results computed with is_causal do not properly reflect causal masking. In PyTorch 2.0, Accelerated PT Transformers added the is_causal parameter to legacy nn.Transformer* and nn.MHA APIs aligned with and intended to engage the is_causal parameter of the new scaled_dot_product_attention (SDPA) operator. At present is_causal works differently for Transformer* modules, the nn.MHA and F.MHA: * The nn.Transformer* modules treat is_causal as an optional indicator about the format of attn_mask. This is because some layers (such as the CLIP layer use the attention mask in the layer, and thus the attn_mask was a required feature.) * Initially, nn.MHA and F.MHA were defined to align with F.SDPA in behavior: a user may specify either the attention mask, or is_causal, but not both. It seemed to make sense at the time to align SDPA and MHA, esp since there was a larger overlap of parameters which have since changed, e.g., with the removal of need_weights from SDPA. (See below for why this makes sense.) Unfortunately, this does not work because of how MHA was changed to handle the need_weights parameter. When need_weights is present, we do not (any more) call SDPA because support for need_weights was removed from SDPA before the release. The rationale is that need_weights defeats all optimization at the foundation of SDPA performance. Having the flag might thus mislead users into thinking they get good performance and have them disappointed when they enable a legacy feature of MHA which massively degrades performance. (They might not think anything of enabling that, because it is on by default in MHA today, which leads to more issues.) Since SDPA does not (no longer) support need_weights, we need to pick a separate path which implements attention using a set of discrete operations that allocates a tensor for weights. Alas, this code path does not have support for is_causal, because attention is implemented as matmul and using the attention mask. Thus, is_causal has no impact. (A substantially similar situation arises with how kpm is implemented today because Nested Tensors are not supported by torch.compile() in 2.0) This problem was masked because all uses of legacy nn.MHA (and F.MHA) come through nn.Transformer* which called self-attention (i.e., nn.MHA) only ever with the attention mask attn_mask, and never with is_causal, a missed optimization opportunit that would have been addressed in a future performance update. Regrettably, always calling nn.MHA with attn_mask prevented diagnosing of the issue of not having a suitable attention mask when need_weights support was dropped from SDPA and a discrete implementation of attention was added for that scenario, and for the execution path with key_padding_mask. We have two options to address this issue: Solution 1: Whenever nn.MHA and F.MHA are executed with is_causal set, we internally create a causal mask at significant expense of allocating a tensor and filling it with a triangular causal matrix. This increases memory usage, and runtime, for allocating a causal mask. To add insult to injury, in all current (and likely future) execution scenarios, MHA is called by a model using the nn.Transformer API which already has that matrix and passes it from nn.module to nn.module. Then the passing in of attn_mask has to be suppressed by nn.TransformerEncoderLayer, only for nn.MHA to immediately allocate the very same tensor again to satisfy the requirement to have an attention mask for the computation. (We expect new use cases to use SDPA directly.) Solution 2: We align the behavior of nn.MHA and F.MHA with the rest of the existing nn.Transformer API, and require the attention mask to be passed into nn.MHA in addition to is_causal as an optional indicator about the nature of the attention mask rather than as an alternative to attn_mask. Then, when we choose the code path for processing MHA with need_weights or a key_padding_mask, we have the attn_mask passed down through the nn.Transformer* hierarchy, without the added overhead of allocating an attention mask as in scenario 1. This PR implements solution 2 which offers better performance and in retrospect aligns MHA better with the rest of the Transformer modules as the definition of SDPA evolved into a more streamlined high-performance operator. It ostensibly changes how is_causal works, by requiring the attention mask to be specified. However, as described here, and as shown in the submitted issue, is_causal is not working as intended today, so it requires a change regardless. In that sense, a change in API does not occur per-se, as the current implementation is not working, and a change has to occur either way to resolve the submitted issue, breaking any use cases that depend on the current implementation. Checks exist (and more can be added) that flag any scenarios where is_causal is passed as True, but no attention mask is provided, ensuring that there's not quiet change from even the faulty behavior present in 2.0. As an upside, the present implementation will improve performance by addressing the passing of the is_causal flag from Transformer modules to MHA, speeding up training for these examples, e.g., finetuning BERT, RoBERTa, XLM-R models. Differential Revision: D44245725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97214 Approved by: https://github.com/albanD	2023-03-25 01:36:30 +00:00
CedricPicron	cf0ba1b9c0	Use L1 loss for Smooth L1 loss with beta=0 (#97022 ) Fixes #96813. Comments: 1. Wasn't able to test since tools/nightly.py does not allow for GPU build (and I don't want to build from scratch). 2. In theory, the bug (i.e. NaNs) can still occur when beta is very small (e.g. `beta=1e-50`), but not sure whether anybody cares. 3. Some checks within the smooth_l1_loss C++ code could be changed to check for `beta > 0` instead of `beta >= 0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97022 Approved by: https://github.com/jbschlosser	2023-03-24 19:10:32 +00:00
Xiao Wang	1716709d46	[CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs [v2] (#96586 ) Fixes #96429 This PR is also a follow up for #90427. In that PR, we also discussed whether calculations of grid indices `grid_sampler_compute_source_index` should also be upcasted to `opmath_t` https://github.com/pytorch/pytorch/pull/90427/files#r1048876708. Due to another unit test failure, we didn't upcast those calculations in that PR. After some investigations, I found that the inaccurate results have nothing to do with the internals of `affine_grid`, even if it's calculated using `double` internally. As long as input `grid` is passed to `grid_sample` in half precision, the results will be less inaccurate than a float `grid`. This can be verified with a short C++ program like this (by setting `TYPE_T` to `__half` and `float` in compilations) ```cpp #include <cuda.h> #include <cuda_runtime.h> #include <cuda_fp16.h> #include <iostream> #ifndef TYPE_T #define TYPE_T float #endif int main() { using type_t = TYPE_T; type_t d = static_cast<__half>((double)2.0 / 3.0); type_t s = (((float)d + 1.f) * 3 - 1) / 2; printf("%.15f %.15f\n", (double)d, (double)s); } ``` Outputs are ``` ./float.out 0.666503906250000 1.999755859375000 ./half.out 0.666503906250000 2.000000000000000 ``` To resolve the discussion back in https://github.com/pytorch/pytorch/pull/90427/files#r1048876708, I've also increased the test tolerance in the failed unit test `issue_24823_1(torch.half)`. For the original script in #96429, I got more accurate results with `align_corners = True` ``` align_corners = True Expected result has mean absolute value of 0.5285 and maximum absolute value of 3.2067. Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum. align_corners = False Expected result has mean absolute value of 0.5189 and maximum absolute value of 3.0101. Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96586 Approved by: https://github.com/ngimel	2023-03-15 19:25:20 +00:00
Eddie Yan	70090b4daf	[CUDA] Abate spurious resize warnings in MultiMarginLoss backward (#96382 ) Follow-up of #75000 for backward. CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/96382 Approved by: https://github.com/ngimel	2023-03-14 05:54:23 +00:00
soulitzer	7ff9612e34	Improve error message for instance norm when channels is incorrect (#94624 ) Fixes https://github.com/pytorch/pytorch/issues/90514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94624 Approved by: https://github.com/jbschlosser	2023-03-04 02:06:48 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
ganler	0176405c69	fix: check if double to i64 is in well-formed range (#94290 ) Fixes #88951 The output shape of upsample is computed through `(i64)idim * (double)scale` and then casted back to `i64`. If the input scale is ill-formed (say negative number as #88951) which makes `(double)(idim * scale)` to be out of the range for `i64`, the casting will be an undefined behaviour. To fix it, we just check if `(double)(idim * scale)` can fit into `i64`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94290 Approved by: https://github.com/malfet	2023-02-10 22:35:22 +00:00
Jiayi Sun	01de5ddafc	add mixed data type support for LayerNorm backward on CPU (#88064 ) ### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm. ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| bf16 forward (ms) \| mix forward (ms) \| fp32 backward (ms) \| bf16 backward (ms) \| mix backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.071 \| 0.065 \| 0.062 \| \| (8, 8, 16) \| 0.015 \| 0.014 \| 0.015 \| 0.074 \| 0.070 \| 0.063 \| \| (32, 8, 16) \| 0.062 \| 0.016 \| 0.016 \| 0.073 \| 0.073 \| 0.072 \| \| (64, 128, 56, 56) \| 2.467 \| 0.907 \| 0.0897 \| 12.993 \| 7.603 \| 7.777 \| \| (64, 128, 256, 256) \| 48.904 \| 25.589 \| 25.472 \| 343.992 \| 183.133 \| 188.222 \| Single core(icx): \| shape \| fp32 forward (ms) \| bf16 forward (ms) \| mix forward (ms) \| fp32 backward (ms) \| bf16 backward (ms) \| mix backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.050 \| 0.050 \| 0.050 \| \| (8, 8, 16) \| 0.014 \| 0.014 \| 0.014 \| 0.052 \| 0.054 \| 0.053 \| \| (32, 8, 16) \| 0.034 \| 0.019 \| 0.018 \| 0.059 \| 0.067 \| 0.066 \| \| (64, 128, 56, 56) \| 66.791\| 17.725 \| 19.799 \| 119.431 \| 106.123 \| 107.446 \| \| (64, 128, 256, 256) \| 1542.477 \| 402.132 \| 527.044 \| 3019.437 \| 2336.318 \| 2448.320 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/88064 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-02-10 03:10:14 +00:00
Nicolas Hug	544c04f2df	Add uint8 support for interpolate for CPU images (#90771 ) Joint work with @vfdev-5 This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ). On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`: ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (Note: we removed bicubic support for now) (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms ``` There is still room for further speed-ups (see TODOs in the code). #### More benchmark details with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why. <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 5X 1.1ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 12X 2.9ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 3X 0.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 7X 1.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 2.6X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 1.7X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 2.7X 0.7ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 1.8X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 4X 1.0ms vs 0.2ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 4X 2.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 3.0X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 3X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 4X 2.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 4X 2.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 7X 4.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 3X 2.1ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 4X 2.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 2.7X 1.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 2.6X 1.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 2.1X 1.2ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 2.8X 1.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 5X 2.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 2.3X 1.4ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 3X 1.9ms vs 0.6ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 4X 26.6ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 4X 23.9ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 2.5X 16.8ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 5X 33.1ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 4X 25.9ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 8X 59.6ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.9X 14.3ms vs 7.4ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 5X 35.4ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 2.0X 13.6ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 2.2X 14.8ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 1.3X 8.8ms vs 6.9ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.2X 8.4ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.8X 12.8ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 4X 32.1ms vs 7.2ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.4X 10.1ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.9X 20.9ms vs 7.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 2.1X 0.7ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 1.9X 0.6ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.6X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.8X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.2X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 1.2X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 1.5X 1.0ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.2X 0.8ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 2.3X 1.5ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.9X 1.2ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.6X 1.2ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 4X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 2.4X 1.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 2.8X 1.8ms vs 0.6ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 2.1X 12.8ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.6X 3.8ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 1.2X 7.1ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.9X 11.0ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 2.0X 12.6ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 6.1ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 1.8X 11.3ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 4.6ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.6X 9.3ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.3X 2.0ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.2X 7.2ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.3X 1.6ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.1X 7.1ms vs 6.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 0.6X 3.3ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 0.9X 5.9ms vs 6.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.4X 2.4ms vs 5.9ms ``` </details> without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs) <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 0.8X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 1.5X 1.7ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 0.9X 1.6ms vs 1.8ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 2.1X 3.9ms vs 1.9ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 0.8X 1.1ms vs 1.4ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 1.7X 2.4ms vs 1.5ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 0.9X 0.5ms vs 0.6ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 0.7X 0.5ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 0.9X 0.9ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 2.1X 2.0ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 0.8X 0.6ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 1.7X 1.3ms vs 0.8ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 1.0X 3.0ms vs 3.0ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 1.0X 2.8ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 1.0X 2.3ms vs 2.2ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 1.4X 3.3ms vs 2.3ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 1.0X 3.5ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 1.7X 6.1ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 0.9X 2.6ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 1.4X 4.2ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 1.0X 1.7ms vs 1.7ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 0.9X 1.6ms vs 1.8ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 0.9X 1.3ms vs 1.4ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 0.7X 1.1ms vs 1.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 1.0X 2.0ms vs 2.0ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 1.7X 3.2ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 0.8X 1.5ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 1.2X 2.3ms vs 1.9ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 1.1X 34.7ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 1.0X 31.2ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 1.0X 23.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 1.9X 42.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 0.9X 33.9ms vs 37.4ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 2.2X 84.0ms vs 37.5ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.0X 28.4ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 2.0X 56.7ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 1.1X 17.5ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 1.1X 17.7ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 0.8X 8.8ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.0X 11.1ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.1X 19.9ms vs 18.8ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 2.3X 42.5ms vs 18.7ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.0X 14.1ms vs 14.5ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.0X 28.4ms vs 14.5ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.0X 0.6ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.3ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 0.9X 0.5ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.7X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 1.0X 0.8ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.1X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 0.9X 0.7ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 0.9X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.8X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.3X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 0.9X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 1.2X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 0.8X 2.1ms vs 2.5ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 0.7X 1.6ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 1.2X 2.4ms vs 2.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 1.3X 2.6ms vs 2.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 1.1X 3.4ms vs 3.0ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 1.7X 4.8ms vs 2.8ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 1.1X 2.9ms vs 2.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 1.4X 3.5ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 0.9X 1.2ms vs 1.3ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.3X 1.6ms vs 1.2ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 0.8X 0.9ms vs 1.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.3X 1.3ms vs 1.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.4X 2.2ms vs 1.6ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 1.9X 2.8ms vs 1.5ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 0.8X 1.1ms vs 1.4ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 1.7X 2.1ms vs 1.3ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 1.0X 10.0ms vs 9.9ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.7X 4.6ms vs 6.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 0.9X 9.1ms vs 9.8ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.7X 9.4ms vs 5.7ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 1.0X 15.2ms vs 14.8ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 7.6ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 0.9X 13.3ms vs 14.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 5.9ms vs 7.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.2X 6.0ms vs 5.2ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.7X 2.3ms vs 3.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.0X 4.8ms vs 5.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.7X 1.9ms vs 2.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.6X 12.3ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 1.0X 3.9ms vs 3.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 1.0X 7.0ms vs 7.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.9X 3.0ms vs 3.5ms ``` </details> Benchmark code <details> ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu') if channels_last: input_image = input_image.contiguous(memory_format=torch.channels_last) self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "antialias": antialias, "dtype":dtype, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, antialias, dtype): if dtype == torch.float: input_image = input_image.float() out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias) if dtype == torch.float: out = out.round().clamp(min=0, max=256).to(torch.uint8) def make_config(): sizes = ( ((224, 224), (64, 64)), ((270, 268), (224, 224)), ((256, 256), (1024, 1024)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels # attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels # attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True, False], 'mode': ["bilinear", "bicubic"], 'antialias': [True, False], # 'dtype': [torch.float, torch.uint8] # 'dtype': [torch.uint8] 'dtype': [torch.float] }, tags=["short"], ) return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` ```py import re import argparse parser = argparse.ArgumentParser() parser.add_argument("f1", nargs="?", default="main") parser.add_argument("f2", nargs="?", default="new") args = parser.parse_args() with open(args.f1) as f: main = f.readlines() with open(args.f2) as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): # num_threads=1 # TODO: remove if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("antialias=", "") deets = deets.replace("channels_last=", "") # deets = deets.replace("channels_last=True, ", "") split = deets.split(",") # size = ','.join(split[:-3]) # mode, dtype, threads = split[-3:] # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" size = ','.join(split[:-5]) channels_last, mode, antialias, dtype, threads= split[-5:] deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("\(.?\)", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: # assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 8 == 0: print() # if i % 10 == 0 and i % 40 != 0: # print() # if i % 40 == 0: # print("-" 100) print(l) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90771 Approved by: https://github.com/peterbell10, https://github.com/ngimel	2023-02-10 01:43:54 +00:00
ecao	81e318353f	Align input memory format and grad memory format for GroupNorm backward (#92668 ) Fixes the skipped part of the test on https://github.com/pytorch/pytorch/pull/92671. Align the input memory format and the grad memory format for GroupNorm backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92668 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-02-09 08:56:43 +00:00
Nikita Shulga	768e547543	Fix SIGFPE in slow_conv3d_forward_out_cpu (#94325 ) Set number of groups to 0 if weights second dimension is zero. `slow_conv_shape_check` will raise an exception if groups are zero anyway. Fixes SIGFPE reported in https://github.com/pytorch/pytorch/issues/94125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94325 Approved by: https://github.com/albanD	2023-02-08 14:15:39 +00:00
Aaron Gokaslan	3ce1ebb6fb	Apply some safe comprehension optimizations (#94323 ) Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323 Approved by: https://github.com/albanD	2023-02-07 23:53:46 +00:00
Vivswan Shah	8c1ee89f19	Added super init to Module (#91819 ) Added super init to Module for complex user modules derived from multiple python classes. And by adding the super __init__ call at the end so it doesn't change any functionality of Module class. I am working on building a module for simulating analog neural network on PyTorch. and this small change is really useful for that and we can definitely think of many other useful cases especially for more module or mro hierarchy. Issues: https://github.com/pytorch/pytorch/issues/28746, https://github.com/pytorch/pytorch/issues/48626, https://github.com/pytorch/pytorch/issues/61662, https://github.com/pytorch/pytorch/issues/74036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91819 Approved by: https://github.com/albanD	2023-02-01 22:17:59 +00:00
Michael Gschwind	64d0624cee	Explicit Name needed to run with buck test (#93035 ) Summary: Explicit Name needed to run with buck test Test Plan: sandcastle Differential Revision: D42763774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93035 Approved by: https://github.com/cpuhrsch	2023-01-27 14:36:46 +00:00
Jane Xu	b90496eef5	[nn] zero_grad() set_to_none default True (#92731 ) Attempts to fix #92656 BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731 Approved by: https://github.com/ngimel	2023-01-26 01:04:28 +00:00
Nikita Shulga	97b7e4cdd5	Fix GroupNorm backward prop on CUDA (#92671 ) Fixes regression introduced by https://github.com/pytorch/pytorch/pull/89485 Adds test to prevent those regressions from happening in the future In process, discovered that GroupNormBackwards on CPU does not produce the same results if input and gradient memory_format is different Fixes #92166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92671 Approved by: https://github.com/ngimel, https://github.com/xuzhao9	2023-01-20 22:22:01 +00:00

1 2 3 4 5 ...

1398 Commits