pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
vfdev-5	a1dd367716	Fixed error in bicubic upsampling aa=false for uint8 input (#118389 ) Description: - Fixed error in bicubic upsampling aa=false for uint8 input. This is seen in the test suite: ```diff - self.assertLess(diff.max(), 15) + self.assertLess(diff.max(), 5) ``` While reducing the input range we do not fully remove the clipping effect that's why the threshold is 5 and not around 1. - Renamed methods - The error is mostly visible for upsampling (smaller -> larger) mode on the boundary values More details on the bug: For uint8 input and antialising=False we are using separable algorithm (using temp buffers and interpolating dimensions one by one) where interpolation weights and input indices are computed and stored using index ranges: `index_min` and `index_size`; weights outside of the `index_size` are zeros. For example, for an output point we can have index_min=10 and index_size=4 and 4 non-zero weights: so the output value is computed as ``` out_value = sum([src[i + index_min] * w for i, w in zip(range(4), weights) ]) ``` When computing index ranges and weights for output points near the boundaries we should clamp `index_min` between 0 and input_size and `index_size` becomes smaller than 4. This approach is OK for antialiasing=True but is not correct for antialiasing=False where weights are computed incorrectly: ``` -- output index i= 0 regular float32 approach: source indices: [-2, -1, 0, 1] -> outbounded values are clamped to boundaries -> [0, 0, 0, 1] interp weights: [-0.07200000000000006, 0.4600000000000001, 0.72, -0.1080000000000001] separable uint8 approach: source indices coming from index ranges (min, size): [0, 1] incorrect interp weights computed with current implementation : [1.1764705882352944, -0.17647058823529432, 0.0, 0.0] fixed interp weights in the PR: [1.108, -0.1080000000000001, 0.0, 0.0] Note: weight value corresponding to source index 0 is 1.108 = -0.07200000000000006 + 0.4600000000000001 + 0.72 and weight value corresponding to source index 1 is -0.1080000000000001 is the same as in f32 approach. ``` Quick benchmark to ensure perfs no regression: ``` [------------------------------------------------------------------------------------ Resize ------------------------------------------------------------------------------------] \| torch (2.3.0a0+gitfda85a6) PR \| torch (2.3.0a0+git0d1e705) Nightly \| Speed-up: PR vs Nightly 1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_first bilinear (400, 400) -> (224, 224) aa=False \| 440.996 (+-2.044) \| 470.824 (+-5.927) \| 1.068 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (224, 224) aa=False \| 463.565 (+-1.519) \| 497.231 (+-10.825) \| 1.073 (+-0.000) 3 torch.uint8 channels_first bilinear (400, 400) -> (700, 700) aa=False \| 1717.000 (+-28.589) \| 1915.570 (+-43.397) \| 1.116 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (700, 700) aa=False \| 1801.954 (+-22.391) \| 1981.501 (+-37.034) \| 1.100 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (224, 224) aa=False \| 199.599 (+-0.851) \| 196.535 (+-3.788) \| 0.985 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (224, 224) aa=False \| 243.126 (+-0.681) \| 240.695 (+-2.306) \| 0.990 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (700, 700) aa=False \| 686.270 (+-2.870) \| 687.769 (+-17.863) \| 1.002 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (700, 700) aa=False \| 899.509 (+-5.377) \| 899.063 (+-9.001) \| 1.000 (+-0.000) Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118389 Approved by: https://github.com/NicolasHug ghstack dependencies: #118388	2024-02-01 14:14:32 +00:00
vfdev-5	eba4bd6b86	Updated test_upsamplingBiMode2d_consistency (#118388 ) Description: - Lowered error thresholds and added input range for bicubic to make visible the inconsistency error in the implementation for upsampling (smaller -> larger) bicubic aa=false mode for uint8 input dtype - Updated out-dated comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/118388 Approved by: https://github.com/NicolasHug	2024-02-01 09:22:23 +00:00
Peter Bell	19e8ba95e5	[RELAND] Remove deprecated fbgemm operators (#112153 ) These operators are not used and have been deprecated since #72690 (Feb 2022). BC-breaking message: `TorchScript` models that were exported with the deprecated `torch.jit.quantized` API will no longer be loadable, as the required internal operators have been removed. Please re-export your models using the newer `torch.ao.quantization` API instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153 Approved by: https://github.com/jerryzh168	2024-01-30 16:32:37 +00:00
Oguz Ulgen	3b38f7b266	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 16:11:38 +00:00
PyTorch MergeBot	bb28965924	Revert "Remove skips for passing tests (#118000 )" This reverts commit `3c339b5b21`. Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))	2024-01-23 06:10:25 +00:00
Oguz Ulgen	3c339b5b21	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 03:41:23 +00:00
haozhe.zhu@intel.com	0ae952db76	enable mkldnn bf32 matmul (#116015 ) ### Testing FP32 matmul vs. mkldnn BF32 matmul on SPR single core: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 32.842 \| 38.279 \| 1.165 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 38.590 \| 73.967 \| 1.917 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 18456.267 \| 74588.002 \| 4.041 56 cores: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 1199.400 \| 1715.548 \| 1.430 M: 8192, N: 768, K: 768, trans_a: False, trans_b: True \|1129.204 \| 1708.912 \| 1.513 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False \| 3655.915 \| 7992.877 \| 2.186 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True \| 3707.993 \| 8026.191 \| 2.165 Batch: 768, M: 128, N: 64, K: 128 \| 1296.419 \| 1308.411 \| 1.009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-20 09:30:23 +00:00
CaoE	c9528a11dd	Add Half support for masked_softmax on CPU (#117028 ) Add Half support for `masked_softmax` on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117028 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 08:59:20 +00:00
vfdev-5	1a57c18760	Fixed cuda grads for interpolate::trilinear on non-contig grad output (#117373 ) Fixes #113642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117373 Approved by: https://github.com/lezcano	2024-01-15 18:05:47 +00:00
vmoens	6f0f4f12ca	[BugFix] Prevent LSTM to run with wrong input shape (#115542 ) Fixes #114874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115542 Approved by: https://github.com/mikaylagawarecki	2024-01-11 02:57:09 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Aaron Gokaslan	3fe437b24b	[BE]: Update flake8 to v6.1.0 and fix lints (#116591 ) Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling. - Replace `assert(0)` with `raise AssertionError()` - Remove extraneous parenthesis i.e. - `assert(a == b)` -> `assert a == b` - `if(x > y or y < z):`->`if x > y or y < z:` - And `return('...')` -> `return '...'` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-03 06:04:44 +00:00
Sun, Jiayi	c173a9d9b3	add Half support for layer_norm on CPU (#99590 ) ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.051 \| 0.051 \| 0.050 \| \| (8 ,8, 16) \| 0.013 \| 0.013 \| 0.013 \| 0.054 \| 0.053 \| 0.051 \| \| (32, 8, 16) \| 0.015 \| 0.014 \| 0.014 \| 0.059 \| 0.054 \| 0.052 \| \| (64, 128, 56, 56) \| 1.875 \| 0.790 \| 1.016 \| 12.845 \| 7.151 \| 6.985 \| \| (64, 128, 256, 256) \| 50.226 \| 25.462 \| 35.736 \| 328.957 \| 179.615 \| 175.618 \| Single core (icx): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.040 \| 0.041 \| 0.041 \| \| (8 ,8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.042 \| 0.042 \| 0.042 \| \| (32, 8, 16) \| 0.027 \| 0.014 \| 0.014 \| 0.048 \| 0.048 \| 0.046 \| \| (64, 128, 56, 56) \| 58.054 \| 11.034 \| 17.928 \| 108.603 \| 48.816 \| 50.244 \| \| (64, 128, 256, 256) \| 1327.758 \| 352.394 \| 496.994 \| 2846.182 \| 1224.247 \| 1218.422 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-20 01:11:15 +00:00
eqy	d55365dc05	[CUDA] Workaround shmem limit for certain input sizes in `AdaptiveAvgPool1D` (#115231 ) Reference issue #68248 CC @ptrblck @malfet @xwang233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115231 Approved by: https://github.com/mikaylagawarecki	2023-12-19 22:40:10 +00:00
PyTorch MergeBot	c006c8b50e	Revert "markDynamoStrictTest some more (#115885 )" This reverts commit `55ce4693ff`. Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))	2023-12-15 19:51:24 +00:00
rzou	55ce4693ff	markDynamoStrictTest some more (#115885 ) Featuring test_native_mha.py test_nn.py test_prims.py test_schema_check.py test_serialization.py test_show_pickle.py test_sort_and_select.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879	2023-12-15 13:19:52 +00:00
eqy	9056903b09	[CUDA] 64-bit indexing for avg_pool_backward (#114193 ) Fixes #113833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114193 Approved by: https://github.com/malfet	2023-12-15 03:58:46 +00:00
Mikayla Gawarecki	f5919335db	Fix _load_from_state_dict for num_batches_tracked in batchnorm (#115285 ) I approved https://github.com/pytorch/pytorch/pull/110850 which did the following Previously: `num_batches_tracked` not in state_dict when doing `m.load_state_dict(state_dict)` --> always overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor Now: `num_batches_tracked` not in state_dict loaded when doing `m.load_state_dict(state_dict)` --> only overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor if module does not have `num_batches_tracked` This causes the following issue: ``` with torch.device('meta'): m = BatchNorm(...) m.load_state_dict(state_dict, assign=True) ``` If `num_batches_tracked` is not in `state_dict`, since `modules's` `num_batches_tracked` is present on meta device, it is not overwritten with a 0 cpu tensor. When compiling, this error is raised ``` AssertionError: Does not support mixing cuda+meta ``` I am not sure whether the explicit check for meta device makes sense as a fix, will add testing if this fix is ok Pull Request resolved: https://github.com/pytorch/pytorch/pull/115285 Approved by: https://github.com/albanD	2023-12-07 22:48:26 +00:00
Jeff Daily	4c04ae2451	[ROCm] fix test_softmax_forward_64bit_indexing_cuda OOM (#113093 ) TestNNDeviceTypeCUDA.test_softmax_forward_64bit_indexing_cuda started failing for ROCm after #112096 with the message torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 13.35 GiB. GPU 0 has a total capacity of 31.98 GiB of which 3.89 GiB is free. Of the allocated memory 26.69 GiB is allocated by PyTorch, and 18.91 MiB is reserved by PyTorch but unallocated. This amounts to approximately 41GB. The test is currently decorated with `largeTensorTest("30GB", "cuda")` but this is not sufficient for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113093 Approved by: https://github.com/malfet	2023-11-07 03:00:37 +00:00
Eddie Yan	e39668770a	[CUDA] 64-bit indexing fixes for cross-entropy kernels (#112096 ) For #108345, #111484 Addresses the forward kernels implicated in the issues, but will take another look at the backward kernels (in follow-up PRs if necessary). The spatial softmax kernel is changed to use signed integer indexing rather than unsigned as `ScalarType` only has signed integer types declared for now, but this should be a minor change. CC @ptrblck @crcrpar (who landed a few related PRs recently). Pull Request resolved: https://github.com/pytorch/pytorch/pull/112096 Approved by: https://github.com/mikaylagawarecki	2023-11-06 17:37:08 +00:00
Tobias Ringwald	29716e865c	Enforce both input tensor shapes of CosineEmbeddingLoss to be equal. (#112782 ) …Added a test to prevent regressions. Fixes #112732. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112782 Approved by: https://github.com/lezcano	2023-11-03 15:15:06 +00:00
Tristan Rice	013f622dd2	grid_sample: support bfloat16 (#112331 ) This adds bfloat16 support to `torch.nn.functional.grid_sample` this is particularly important when doing feature sampling such as for rendering techniques used in PyTorch3d or for camera projections to voxel grids such as in SimpleBEV. Related to #57707 Test plan: ``` pytest test/test_nn.py -k grid_sample pytest test/test_ops.py -k grid_sample ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112331 Approved by: https://github.com/zou3519	2023-10-30 19:31:41 +00:00
Cao E	1c89ea7f72	Add Half support for softmax and log_softmax on CPU (#103315 ) Add Half support for softmax and log_softmax on CPU. Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/malfet	2023-10-26 08:38:54 +00:00
pbialecki	17b732eb04	increase CPU memory requirement for test_nll_loss_large (#110963 ) Running `python test_nn.py -v -k test_nll_loss_large_tensor` on a machine with a small host RAM availability (e.g. ~50GB) fails with a `SIGKILL` even though the currently specified memory requirements for CPU (and GPU) are set to 48GB and are thus met. Profiling the peak memory usage via: ``` \time -v python test_nn.py -v -k test_nll_loss_large_tensor ``` and adding `print(torch.cuda.memory_summaryu())` at the end of the test shows a higher host RAM usage of >100GB and a device memory usage of ~32GB. ``` Command being timed: "python test_nn.py -v -k test_nll_loss_large_tensor" User time (seconds): 81.66 System time (seconds): 229.02 Percent of CPU this job got: 671% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.30 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 118150096 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 90280839 Voluntary context switches: 1669 Involuntary context switches: 1214548 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` ``` \| PyTorch CUDA memory summary, device ID 0 \| \|---------------------------------------------------------------------------\| \| CUDA OOMs: 0 \| cudaMalloc retries: 0 \| \|===========================================================================\| \| Metric \| Cur Usage \| Peak Usage \| Tot Alloc \| Tot Freed \| \|---------------------------------------------------------------------------\| \| Allocated memory \| 32769 MiB \| 32769 MiB \| 81923 MiB \| 49154 MiB \| \| from large pool \| 32768 MiB \| 32768 MiB \| 81921 MiB \| 49152 MiB \| \| from small pool \| 0 MiB \| 0 MiB \| 1 MiB \| 1 MiB \| \|---------------------------------------------------------------------------\| \| Active memory \| 32769 MiB \| 32769 MiB \| 81923 MiB \| 49154 MiB \| \| from large pool \| 32768 MiB \| 32768 MiB \| 81921 MiB \| 49152 MiB \| \| from small pool \| 0 MiB \| 0 MiB \| 1 MiB \| 1 MiB \| \|---------------------------------------------------------------------------\| \| Requested memory \| 32769 MiB \| 32769 MiB \| 81923 MiB \| 49154 MiB \| \| from large pool \| 32768 MiB \| 32768 MiB \| 81921 MiB \| 49152 MiB \| \| from small pool \| 0 MiB \| 0 MiB \| 1 MiB \| 1 MiB \| \|---------------------------------------------------------------------------\| \| GPU reserved memory \| 32774 MiB \| 32774 MiB \| 81938 MiB \| 49164 MiB \| \| from large pool \| 32772 MiB \| 32772 MiB \| 81930 MiB \| 49158 MiB \| \| from small pool \| 2 MiB \| 2 MiB \| 8 MiB \| 6 MiB \| \|---------------------------------------------------------------------------\| ... ``` We haven't seen this issue before as the majority of our runners have sufficient host RAM and I just ran into it by chance. CC @atalman @malfet @crcrpar Pull Request resolved: https://github.com/pytorch/pytorch/pull/110963 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy, https://github.com/malfet	2023-10-25 23:45:47 +00:00
PyTorch MergeBot	5ce8002d24	Revert "Remove deprecated fbgemm operators (#104535 )" This reverts commit `57c7aa12db`. Reverted https://github.com/pytorch/pytorch/pull/104535 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104535#issuecomment-1779650412))	2023-10-25 16:34:16 +00:00
Oleg Bulatov	192477b5ba	Enable flake8-bugbear B020 lint (#110823 ) Fixes part of https://github.com/pytorch/pytorch/issues/106571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823 Approved by: https://github.com/Skylion007	2023-10-24 22:43:47 +00:00
FFFrog	0e0f6a248d	Fix num_batches_tracked of BatchNorm when load_state_dict (#110850 ) Fixes #110361 as the title shown Pull Request resolved: https://github.com/pytorch/pytorch/pull/110850 Approved by: https://github.com/mikaylagawarecki	2023-10-24 04:20:38 +00:00
Peter Bell	57c7aa12db	Remove deprecated fbgemm operators (#104535 ) These operators are not used and have been deprecated since #72690 (Feb 2022). Additionally, the `torch.jit.quantized` interface has been deprecated since #40102 (June 2020). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104535 Approved by: https://github.com/ezyang	2023-10-22 06:10:09 +00:00
CaoE	54c28c564f	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima	2023-09-19 10:43:33 +00:00
lezcano	653c1564bf	Fix broadcasting cosine_similarity (#109363 ) Fixes https://github.com/pytorch/pytorch/issues/109333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109363 Approved by: https://github.com/peterbell10	2023-09-15 17:12:35 +00:00
PyTorch MergeBot	b226373d16	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `b6a1d3fb97`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main `b6a1d3fb97` https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065))	2023-09-14 16:13:34 +00:00
CaoE	b6a1d3fb97	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-14 12:23:59 +00:00
PyTorch MergeBot	04a765f95d	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `6065e7a97c`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` `6065e7a97c` https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208))	2023-09-13 22:38:42 +00:00
CaoE	6065e7a97c	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-13 17:30:16 +00:00
Kurt Mohler	3f88e3105f	Reland: Remove remaining global `set_default_dtype` calls from tests (#108088 ) Fixes #68972 Relands #107246 To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088 Approved by: https://github.com/ezyang	2023-09-07 03:04:34 +00:00
CaoE	8f02884569	add Half support for GroupNorm on CPU (#100234 ) ### Testing Single socket (28cores): * Contiguous: shape \| forward / s\| forward / s\| backward / s\| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 fp16 \| fp32 \| mixed fp32 fp16 [10, 128, 10, 10] \| 2.45E-05 \| 3.26E-05 \| 6.87E-05 \| 7.40E-05 [10, 128, 80, 80] \| 0.000726 \| 0.000606 \| 0.002183 \| 0.001112 * Channels Last: shape \| forward / s\| forward / s\| backward / s\| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 fp16 \| fp32 \| mixed fp32 fp16 [10, 128, 10, 10] \| 2.88E-05 \| 2.72E-05 \| 6.56E-05 \| 6.63E-05 [10, 128, 80, 80] \| 0.00076 \| 0.000256 \| 0.002385 \| 0.000735 Single core: * Contiguous: shape \| forward / s\| forward / s\| backward / s\| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 fp16 \| fp32 \| mixed fp32 fp16 [10, 128, 10, 10] \| 9.47E-05 \| 1.90E-04 \| 2.03E-04 \| 3.10E-04 [10, 128, 80, 80] \| 6.25E-03 \| 8.98E-03 \| 0.016485 \| 0.01369 * Channels Last: shape \| forward / s\| forward / s\| backward / s\| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 fp16 \| fp32 \| mixed fp32 fp16 [10, 128, 10, 10] \| 8.66E-05 \| 7.89E-05 \| 1.95E-04 \| 1.43E-04 [10, 128, 80, 80] \| 5.97E-03 \| 3.13E-03 \| 0.01626 \| 8.70E-03 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100234 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-01 21:25:24 +00:00
Mikayla Gawarecki	3817de5d84	Fix layernorm cpu precision issues (#108089 ) #108072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108089 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-08-30 23:55:10 +00:00
Xia, Weiwen	97a291f6bd	[ONEDNN][BC-breaking] update onednn from v2.7.3 to v3.1.1 (#97957 ) Summary Update onednn from v2.7.3 to v3.1.1. It is bc-breaking as some APIs are changed on oneDNN side. Changes include: - PyTorch code where oneDNN is directly called - Submodule `third_party/ideep` to adapt to oneDNN's new API. - CMAKE files to fix build issues. Test plan Building issues and correctness are covered by CI checks. For performance, we have run TorchBench models to ensure there is no regression. Below is the comparison before and after oneDNN update. ![image](https://github.com/pytorch/pytorch/assets/12522207/415a4ff0-7566-40c6-aed0-24997a475b0e) Note: - Base commit of PyTorch: `da322ea` - CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Ice Lake) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97957 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-08-25 12:13:18 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
lcskrishna	bc662ffff9	[ROCm] Update ROCm skip decorators (#106138 ) This PR adds a msg argument for skipIfRocm and skipCUDAIfRocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106138 Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD	2023-08-18 22:02:06 +00:00
Kurt Mohler	6af6b8f728	Reland: Remove `set_default_dtype` from nn tests (#107069 ) Part of #68972 Relands #105775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107069 Approved by: https://github.com/ezyang	2023-08-14 17:01:57 +00:00
PyTorch MergeBot	ec0f3fda7d	Revert "Remove `set_default_dtype` from nn tests (#105775 )" This reverts commit `4d6a891baf`. Reverted https://github.com/pytorch/pytorch/pull/105775 on behalf of https://github.com/huydhn due to Sorry for reverting you change, it is failing one of the slow test in trunk ([comment](https://github.com/pytorch/pytorch/pull/105775#issuecomment-1675460195))	2023-08-11 22:14:17 +00:00
Kurt Mohler	4d6a891baf	Remove `set_default_dtype` from nn tests (#105775 ) Part of #68972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105775 Approved by: https://github.com/ezyang	2023-08-10 14:56:13 +00:00
Jason Lu	bc88028e8e	Back out "Reland "Make adding buffers more like adding parameters (#104069 )" (#106224 )" (#106743 ) Summary: Original commit changeset: 81319beb97f3 Original Phabricator Diff: D47961182 Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822 Reviewed By: atuljangra Differential Revision: D48131623 @diff-train-skip-merge (D48131623 landed internally) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743 Approved by: https://github.com/malfet	2023-08-08 15:27:34 +00:00
Michael Gschwind	63d45275f4	is causal hints for transformer (#106143 ) Summary: make is_causal hint flags available for the top level transformer module. It's debatable whether this is useful -- at present we autodetect causal masks for src and tgt masks in transformer encoder and decoder, respectively. is_causal flags available woul enable users to short-cut this check by asserting whether they mask is causal, or not. I am putting this diff up for discussion, not as a solution. Not doing anything may be the right solution, unless there is strong (data-driven) user demand. -- it appears the consensus is to move ahead with this, as per discussions below. @cpuhrsch @mikaylagawarecki @jbschlosser @janEbert Test Plan: sandcastle Differential Revision: D47373260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106143 Approved by: https://github.com/mikaylagawarecki	2023-08-04 14:16:48 +00:00
CaoE	f82e6ff29e	add channel last 3d support for batch_norm on CPU (#97774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97774 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-08-03 01:16:05 +00:00
Mikayla Gawarecki	c9be60cd0e	Add error inputs to ModuleInfo (mirroring OpInfo) (#106325 ) Add infra for error inputs to ModuleInfos, migrate first few error inputs tests from test_nn.py (more to come!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106325 Approved by: https://github.com/albanD	2023-08-01 12:49:56 +00:00
Mikayla Gawarecki	d8e5f2aa6d	Reland "Make adding buffers more like adding parameters (#104069 )" (#106224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224 Approved by: https://github.com/atalman, https://github.com/albanD	2023-07-31 17:18:56 +00:00
Mikayla Gawarecki	ca7ece9b50	[easy] improve hint on error message in nn.Module.load_state_dict (#106042 ) Fix #105963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106042 Approved by: https://github.com/albanD	2023-07-27 19:56:02 +00:00
Nikita Karetnikov	eac9e1b35f	[OpInfo] add reference and error inputs for `multilabel_margin_loss` (#105523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523 Approved by: https://github.com/ezyang	2023-07-23 02:16:29 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Andrey Talman	c6653b65d8	Back out "Make adding buffers more like adding parameters (#104069 )" (#105581 ) Summary: D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/ with `TypeError: register_buffer() takes 3 positional arguments but 4 were given` Original commit changeset: d4b4069fbd38 Original Phabricator Diff: D47537831 Test Plan: ``` buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform ``` Reviewed By: atalman Differential Revision: D47600140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581 Approved by: https://github.com/mikaylagawarecki	2023-07-20 03:39:53 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Michael Gschwind	11b753af01	Refactor causal mask generation and detection for nn.transformer (#105265 ) Summary: * Create a private global-scope function _generate_subsequent because static class attribute member functions not supported by TorchScript resulting in torchscripting errors. * Make TransformerEncoder and TransformerDecoder consistent w.r.t. is_causal handling by calling _detect_casual_mask * Clarify documentation that is_causal is a hint * Move causal mask detection into a method _detect_causal_mask * only accept input-size compatible causal mask as causal mask * update _generate_subsequent_causal_mask to include factory kwargs for dtype and device: avoid extra copies & conversions by passing directly to torch.full. Test Plan: sandcastle & github CICD Continuation of #101487 (due to a tooling issue) which is a continuation-in-part of https://github.com/pytorch/pytorch/pull/98327 by @janEbert Differential Revision: D47427117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105265 Approved by: https://github.com/mikaylagawarecki	2023-07-19 01:26:50 +00:00
Danni Li	1b78f23a1a	Allow nn.ChannelShuffle to run without erroring on CUDA tensors (#105351 ) Summary: Include GPU support for `nn.ChannelShuffle` & update test. Fix: #104603 Test Plan: Please see GitHub Actions. Differential Revision: D47523764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105351 Approved by: https://github.com/mikaylagawarecki	2023-07-18 16:24:30 +00:00
ekamiti	32d422f335	Make adding buffers more like adding parameters (#104069 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069 Approved by: https://github.com/mikaylagawarecki	2023-07-17 17:59:05 +00:00
Nikita Karetnikov	0c89596e4f	[OpInfo] add reference and error inputs for `multi_margin_loss` (#104850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850 Approved by: https://github.com/ezyang	2023-07-14 21:16:09 +00:00
yanbing-j	3fe2b73416	Update use_mkldnn in LSTM op to avoid input and parameter not in the same device (#102050 ) This PR is to fix https://github.com/pytorch/pytorch/issues/101935. Only when input, parameters and hidden states are all in CPU device, LSTM will go into oneDNN fast path implementation. Otherwise, it will fallback to the original implmentation. Note here, if input and parameters are indeed not in the same device, it will encounter Error `Input and parameter tensors are not at the same device, found input tensor......` in `check_attributes`. Therefore, the proper usage of LSTM is `input.to(device)` and `model.to(device)` together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102050 Approved by: https://github.com/XiaobingSuper, https://github.com/albanD	2023-07-13 01:13:59 +00:00
Masaki Kozuki	6929e9e947	Use `int64_t` accordingly in `cunn_SoftMaxBackward` to avoid `int` overflow (#104270 ) Fixes #103501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104270 Approved by: https://github.com/malfet, https://github.com/mikaylagawarecki	2023-06-30 21:39:46 +00:00
cyy	54cb61f7d9	enable ASAN on some tests (#103647 ) Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang. The following cited doc explains the reasons. ``` -fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types which would overflow the destination. Because the range of representable values for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are conversions from floating point to integer types. -fsanitize=float-divide-by-zero: Floating point division by zero. This is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value, so is not included in -fsanitize=undefined. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647 Approved by: https://github.com/kit1980	2023-06-28 02:17:14 +00:00
Mikayla Gawarecki	b93ed8164e	Add non-recursive module.to_empty option (#104197 ) Fixes https://github.com/pytorch/pytorch/issues/97049, related to https://github.com/pytorch/pytorch/issues/104187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104197 Approved by: https://github.com/albanD	2023-06-26 21:47:22 +00:00
Ryan Smith	6bda97e2c1	Raise type error message for `interpolate` if `size` contains non-integer elements (#99243 ) Raise type error message for interpolate when output size is a tuple containing elements that are not `int` Fixes #98287 Check is only performed if `size` is an instance of `list` or `tuple`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99243 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/MovsisyanM, https://github.com/albanD	2023-06-23 00:48:45 +00:00
Mikayla Gawarecki	d1cecd9c32	Add assign kwarg to module.load_state_dict (#102212 ) Fixes #64601 and #98906 Adds an `assign` argument to `load_state_dict` that loads params/buffers by assignment instead of doing `param.copy_(param_from_state_dict)`. Primarily intended to remove the need for the `.to_empty()` in ``` with torch.device('meta'): m = SomeModule() m.to_empty() state_dict = torch.load('...pth') m.load_state_dict(state_dict) ``` so we can instead do ``` with torch.device('meta'): m = SomeModule() state_dict = torch.load('...pth') m.load_state_dict(state_dict, assign=True) ``` A problem with this PR for the case where the model is initialized on meta is what happens to nonpersistent buffers/params corresponding to keys missing from the state dict? What happens in the case where `load_state_dict(state_dict, strict=False, assign=True)` and the state_dict is missing some keys? The corresponding params missing from the `state_dict` and nonpersistent buffers would still be on `meta` and need to be manually initialized. However, I don't think we offer an API that would initialize these. One solution would be to make these empty tensors but it might not be semantically correct... Pull Request resolved: https://github.com/pytorch/pytorch/pull/102212 Approved by: https://github.com/albanD	2023-06-15 18:41:00 +00:00
Nicolas Hug	3766c04736	Add uint8 support for CPU images in interpolate(mode='bicubic') (#103252 ) CC @vfdev-5 Proposed strategy: Be as close as possible to PIL when `antialias=True`. Be as close as possible to float path when `antialias=False`. Ad-hoc tests: <details> ```py import random import torch import pytest import numpy as np from PIL import Image from torch.nn.functional import interpolate @pytest.mark.parametrize("C", (1, 3, 6)) @pytest.mark.parametrize("batch_size", (1, 4)) @pytest.mark.parametrize("memory_format", (torch.contiguous_format, torch.channels_last, "strided", "cropped")) @pytest.mark.parametrize("antialias", (True, False)) # @pytest.mark.parametrize("mode", ("bilinear", "bicubic",)) @pytest.mark.parametrize("mode", ("bicubic",)) @pytest.mark.parametrize("seed", range(100)) def test_resize(C, batch_size, memory_format, antialias, mode, seed): def test_resize(C, batch_size, memory_format, antialias, mode, seed): torch.manual_seed(seed) random.seed(seed) Hi = 2random.randint(3, 10) + random.randint(0, 30) Wi = 2random.randint(3, 10) + random.randint(0, 30) Ho = 2random.randint(3, 10) + random.randint(0, 30) Wo = 2random.randint(3, 10) + random.randint(0, 30) # print(Hi, Wi, Ho, Wo) img = torch.randint(0, 256, size=(batch_size, C, Hi, Wi), dtype=torch.uint8) if memory_format in (torch.contiguous_format, torch.channels_last): img = img.to(memory_format=memory_format, copy=True) elif memory_format == "strided": img = img[:, :, ::2, ::2] elif memory_format == "cropped": a = random.randint(1, Hi // 2) b = random.randint(Hi // 2 + 1, Hi) c = random.randint(1, Wi // 2) d = random.randint(Wi // 2 + 1, Wi) img = img[:, :, a:b, c:d] else: raise ValueError("Uh?") margin = 0 img = img.clip(margin, 255 - margin) out_uint8 = interpolate(img, size=[Ho, Wo], mode=mode, antialias=antialias) if antialias and C == 3: out_pil_tensor = resize_with_pil(img, Wo, Ho, mode=mode, antialias=antialias) atol = {"bicubic": 2, "bilinear": 1}[mode] # TODO: is 2 expected when comparing with PIL bicubic? Why not 1 as for bilinear? torch.testing.assert_close(out_uint8, out_pil_tensor, rtol=0, atol=atol) out_float = interpolate(img.to(torch.float), size=[Ho, Wo], mode=mode, antialias=antialias).round().clip(0, 255).to(torch.uint8) if mode == "bicubic": diff = (out_float.float() - out_uint8.float()).abs() assert diff.max() < 30 percent = .03 if antialias else .1 assert (diff > 2).float().mean() < percent mae = .4 if antialias else .8 assert diff.mean() < mae else: torch.testing.assert_close(out_uint8, out_float, rtol=0, atol=1) def resize_with_pil(batch, Wo, Ho, mode, antialias): resample = {"bicubic": Image.BICUBIC, "bilinear": Image.BILINEAR}[mode] out_pil = [ Image.fromarray(img.permute((1, 2, 0)).numpy()).resize((Wo, Ho), resample=resample) for img in batch ] out_pil_tensor = torch.cat( [ torch.as_tensor(np.array(img, copy=True)).permute((2, 0, 1))[None] for img in out_pil ] ) return out_pil_tensor ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103252 Approved by: https://github.com/vfdev-5, https://github.com/H-Huang, https://github.com/malfet, https://github.com/atalman	2023-06-12 18:25:33 +00:00
ecao	73fd7235ad	add function specializations for the case of parameters in BFloat16 data type (#100233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100233 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-05-31 02:01:07 +00:00
vfdev-5	7042e10215	Fixed issue with bicubic interpolation on uint8 input and antialising (#102296 ) Description: - Fixed issue with bicubic interpolation on uint8 input and antialising, discovered by @NicolasHug - Unified `_separable_upsample_generic_Nd_kernel_impl_single_dim` on `antialis` arg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102296 Approved by: https://github.com/NicolasHug	2023-05-30 14:57:19 +00:00
ecao	af1d437654	Improve precision and performance for BFloat16 upsampling (#91169 ) ### Description - Fix precision issue for BFloat16 upsampling: https://github.com/pytorch/pytorch/issues/89212 - Improve performance for BFloat16 upsampling. ### Testing data type: BFloat16 - Single core contiguous: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 14.47 \| 8.34 linear \| 2 \| [3, 200, 200] \| 3.69 \| 2.74 bilinear \| 2 \| [3, 5, 200, 200] \| 87.99 \| 49.05 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 171.02 \| 72.53 bicubic \| 2 \| [3, 3, 200, 200 ] \| 176.29 \| 78 channels last: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 17.70 \| 10.30 linear \| 2 \| [3, 200, 200] \| \ \| \ bilinear \| 2 \| [3, 5, 200, 200] \| 50.90 \| 18.83 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 121.56 \| 42.60 bicubic \| 2 \| [3, 3, 200, 200 ] \| 179.40 \| 80 - 20 cores contiguous: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 1.17 \| 1.01 linear \| 2 \| [3, 200, 200] \| 0.41 \| 0.26 bilinear \| 2 \| [3, 5, 200, 200] \| 7.19 \| 4.07 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 21.32 \| 9.33 bicubic \| 2 \| [3, 3, 200, 200 ] \| 178.67 \| 10 channels last: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 2.25 \| 1.55 linear \| 2 \| [3, 200, 200] \| \ \| \ bilinear \| 2 \| [3, 5, 200, 200] \| 20.17 \| 7.20 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 43.33 \| 15.66 bicubic \| 2 \| [3, 3, 200, 200 ] \| 176.76 \| 10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91169 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/Skylion007	2023-05-29 01:35:57 +00:00
ecao	3f4fee735a	add Half support for logsigmoid, threshold, elu, gelu, hardtanh, hardsigmoid, hardswish, hardshrink, softshrink, leakyrelu, softplus, glu, silu, mish, and prelu on CPU (#98745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98745 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel	2023-05-27 16:20:21 +00:00
ts	563d8058f4	Fix inconsistent torch.nn.MaxPool1d output on cpu and gpu (#99843 ) Fixes #99412 , correctly raising an error when an output of invalid size is calculated. Would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99843 Approved by: https://github.com/mikaylagawarecki	2023-05-15 20:27:43 +00:00
vfdev	a8ea4178ab	Fixed bug in interpolate when interpolation size is larger than max (#101403 ) ## Description This is a bug fix for rare cases that can happen with specific scale, antialias=False, output for a random line can be wrong. For example: ``` line 14 output uint8: [76, 78, 80, 81, 83, 85, 87, 88, 90] expected float: [149, 152, 155, 158, 161, 164, 167, 170, 173] diff: [-73, -74, -75, -77, -78, -79, -80, -82, -83] opencv ref: [149 152 155 158 161 164 167 170 173] ``` It appears that for this line we have 3 weights coeff instead of 2: ``` line 13 \| 351, 2 k: 1130 15254 line 14 \| 378, 3 k: 0 16384 -6780 <------- We should have 2 weights and not 3 line 15 \| 432, 2 k: 15254 1130 ``` which comes from our `_compute_weights_aa` function that is specifically used for AA=False and uint8. ``` xmin = std::max( static_cast<int64_t>(center - support + 0.5 + align_corners_delta), static_cast<int64_t>(0)); xsize = std::min( static_cast<int64_t>(center + support + 0.5 + align_corners_delta), input_size) - xmin; ``` ``` center - support + 0.5 + align_corners_delta: 14.999999999999998 static_cast<int64_t>(center - support + 0.5 + align_corners_delta): 14 xmin -> 14 center + support + 0.5 + align_corners_delta: 17.0 static_cast<int64_t>(center + support + 0.5 + align_corners_delta): 17.0 xsize -> 17 - 14 = 3 <------ 3 instead of 2 ``` For float dtype, AA=False weights and indices are computed differently due to historically first implemented. In any case, `xsize` should not be larger than `max_interp_size`, so we decided to clip `xsize`. Once fixed computed indices and weights are same as for float dtype code path: ``` # Option: xsize = min(xsize, max_interp_size) Line Num \| xmin, xsize 14 \| 378, 2 xmin=378 <---> xmin = i * stride = i * 3 * 9 => i = 14 k: 0 16384 16384 = w * (1 << 14) => w = 1.0 => i=14, w=0 and i=15, w=1 ``` vs ``` Line Num \| index0, index1 F32: 14 \| 15, 16 F32: lambda0, lambda1: 0.999999, 9.53674e-07 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101403 Approved by: https://github.com/NicolasHug	2023-05-15 15:55:42 +00:00
vfdev-5	a3700571e1	Fixed a bug in interpolate uint8 AVX2 on non-contig input (#101136 ) Description: - Fixed a bug in interpolate uint8 AVX2 on non-contig input - Added tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/101136 Approved by: https://github.com/NicolasHug	2023-05-12 17:17:10 +00:00
yanbing-j	36d91b5513	Add differentiable mkldnn_rnn_layer_backward to support double backward of LSTM (#100627 ) ### Description This PR is to fix #99413, which shows the limitation of double backward using oneDNN in LSTM. This PR does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements mkldnn_rnn_layer_backward using differentiable operations, so that double backward can be done automatically. During backward process, it needs to use gates and hidden states between cells during one layer. However, these middle variables are stored in the `workspace`, and it is hard to figure them out. Therefore, in backward, we need re-calculate them first. Corresponding UT has been added based on the failing case in # 99413. The UT with gradcheck and gradgradcheck which is added in https://github.com/pytorch/pytorch/pull/26660 cannot test LSTM using oneDNN, because UT only supports `double` datatype, while oneDNN does not support it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100627 Approved by: https://github.com/jgong5, https://github.com/soulitzer	2023-05-09 12:58:57 +00:00
vfdev-5	ff974cd962	Fixing interpolate on uint8 unsqueezed 3D CL tensor (#100258 ) Description: - Fixed a bug with memory format issue: When input is channels last 4d tensor that was produced as following ``` t = torch.ones(1, 3, 32, 32).contiguous(memory_format=torch.channels_last) t = t[0] t = t[None, ...] ``` upsampling will produce output with channels first memory format but our avx code does not take that into account. Here is a repro code to show that nightly is broken for this particular case: ```python import torch torch.manual_seed(0) input = torch.randint(0, 256, size=(1, 3, 256, 256), dtype=torch.uint8).contiguous(memory_format=torch.channels_last) input = input[0] input = input[None, ...] assert input.is_contiguous(memory_format=torch.channels_last) output = torch.nn.functional.interpolate(input, (224, 224), mode="bilinear", antialias=True) expected = torch.nn.functional.interpolate(input.float(), (224, 224), mode="bilinear", antialias=True) assert output.is_contiguous() assert expected.is_contiguous() torch.testing.assert_close(expected, output.float(), atol=1, rtol=1) # > # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/pytorch/torch/testing/_comparison.py", line 1511, in assert_close # raise error_metas[0].to_error(msg) # AssertionError: Tensor-likes are not close! # # Mismatched elements: 14120 / 150528 (9.4%) # Greatest absolute difference: 214.6112518310547 at index (0, 1, 152, 13) (up to 1 allowed) # Greatest relative difference: 17.005144119262695 at index (0, 2, 26, 2) (up to 1 allowed) ``` - Also renamed needs_unpacking by skip_unpacking Pull Request resolved: https://github.com/pytorch/pytorch/pull/100258 Approved by: https://github.com/NicolasHug	2023-05-04 13:28:33 +00:00
Larry Liu	687afeb686	[dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849 ) Issue: #93684 # Problem Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations. # Design (as I know it) * Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`. * Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent. This PR adds `NumpyTensorVariable` and supports: 1. tensor to ndarray, ndarray to tensor 2. numpy functions such as numpy.meshgrid() 3. ndarray attributes such as `itemsize`, `stride` Next PR will handle returning `np.ndarray` and add support for ndarray methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849 Approved by: https://github.com/ezyang	2023-04-27 16:18:35 +00:00
Yanli Zhao	9bc03db670	Move nn.module state dict pre hook (#98964 ) Some modules like lazyModule may override '_save_to_state_dict()', in this case, pre_state_dict hook will not be called. So move the pre_state_dict hook out from '_save_to_state_dict()' to make sure the pre hook could be called Pull Request resolved: https://github.com/pytorch/pytorch/pull/98964 Approved by: https://github.com/albanD	2023-04-26 16:51:13 +00:00
soulitzer	5ee5afb82c	Update channel shuffle to return alias instead of self as-is (#99745 ) Partially addresses https://github.com/pytorch/pytorch/issues/99655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99745 Approved by: https://github.com/albanD	2023-04-24 14:02:14 +00:00
ts	dbf0db958f	Fix torch.nn.FractionalMaxPool2d output_size error (#99507 ) Fixes #99148 , raising an error if output_ratio's size > 2. Justification for changes: If an output size is not specified but an output ratio is, we call fractional_max_pool2d_with_indices. We then generate the value of output_size based on the first two integers of the output_ratio (line ~480 of torch.nn.functional.py). Thus, we should raise a value error in the case that the user passes an output_ratio (instead of an output_size) and the number of elements in output_ratio exceeds two. We must raise an error before calling torch._C._nn.franctional_max_pool2d as the value of output_size passed into torch._C._nn.fractional_max_pool2d is guaranteed to be of size 2 (as the existing code generates it from the first two indices of the passed in ratio). I would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99507 Approved by: https://github.com/mikaylagawarecki	2023-04-21 14:38:25 +00:00
vfdev-5	5907173022	Updated upsampling test to use parametrize_test decorator (#97769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97769 Approved by: https://github.com/NicolasHug	2023-04-11 12:20:00 +00:00
Kiersten Stokes	2a48f43fe2	Add check for 0 to 1 inclusive for elements of target tensor in BCE loss (#97814 ) TODO for @mikaylagawarecki : add BC breaking description Fixes #87373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97814 Approved by: https://github.com/mikaylagawarecki	2023-04-05 23:26:09 +00:00
Lei Mao	937ba248eb	Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000 ) ## BC-breaking note: This is technically a bugfix. Prior to this PR, for `torch.nn.functional.grid_sample(mode='nearest')` the 2D kernel used `std::nearbyint` whereas the 3D kernel used `std::round` in order to determine the nearest pixel locations after un-normalization of the grid. This PR fixes the 3D kernel to use `std::nearbyint` which rounds values that are exactly `<>.5` to the nearest even which is consistent with the behavior of `torch.round`. Unnormalized indices that are exactly `<>.5` will now be rounded to the nearest even instead of being rounded away from 0. ## Description In the nearest neighbor interpolation mode, the 2D GridSample rounds index to the nearest even using [std::nearbyint](https://github.com/pytorch/pytorch/blob/v2.0.0/aten/src/ATen/native/cpu/zmath.h#L182) whereas the 3D GridSample rounds index away from zero using std::round. This discrepancy needs to be resolved. We are making both 2D GridSample and 3D GridSample to round to the nearest even. ## Unit Test Goals 1. Make sure the x dimension and y dimension rounding behaviors are the same for 2D GridSample. 2. ~~Make sure the 2D GridSample rounding mode is rounding to the nearest even.~~ 3. Make sure the x dimension, y dimension, and z dimension rounding behaviors are the same for 3D GridSample. 4. ~~Make sure the 3D GridSample rounding mode is rounding to the nearest even.~~ 5. The 2D GridSample and 3D GridSample rounding behaviors are exactly the same. After some experiments, I found 2 and 4 are difficult to achieve. Even though I can compute the normalized coordinates corresponding to the unnormalized coordinates including [0, 0.5, 1.0, 1.5, 2.0, 2.5, ..., 10.0], the unnormalization process in the GridSample implementations always have a small chance of having floating point error. Therefore, it's not possible to unit test the rounding mode from the normalized coordinates. ## Unit Test Methods The unit test is simple. By using the same values along the dimension to be tested in the input tensor and the same normalized indices in the grid tensor. The interpolation on the 2D GridSample x-dimension, 2D GridSample y-dimension, 3D GridSample x-dimension, 3D GridSample y-dimension, 3D GridSample z-dimension. Should produce exactly the same interpolated values. If one CPU/CUDA 2D/3D implementation use a different rounding mode from others, the unit test shall fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97000 Approved by: https://github.com/mikaylagawarecki	2023-04-05 18:47:03 +00:00
Michael Gschwind	c757647dd8	[Better Transformer] make is_causal a hint and force attn_mask to be set on `is_causal=True` in F.MHA (#97214 ) Summary: This fixes an issue raised in [is_causal parameter in torch.nn.TransformerEncoderLayer.forward does not work #96941](https://github.com/pytorch/pytorch/issues/96941) where results computed with is_causal do not properly reflect causal masking. In PyTorch 2.0, Accelerated PT Transformers added the is_causal parameter to legacy nn.Transformer* and nn.MHA APIs aligned with and intended to engage the is_causal parameter of the new scaled_dot_product_attention (SDPA) operator. At present is_causal works differently for Transformer* modules, the nn.MHA and F.MHA: * The nn.Transformer* modules treat is_causal as an optional indicator about the format of attn_mask. This is because some layers (such as the CLIP layer use the attention mask in the layer, and thus the attn_mask was a required feature.) * Initially, nn.MHA and F.MHA were defined to align with F.SDPA in behavior: a user may specify either the attention mask, or is_causal, but not both. It seemed to make sense at the time to align SDPA and MHA, esp since there was a larger overlap of parameters which have since changed, e.g., with the removal of need_weights from SDPA. (See below for why this makes sense.) Unfortunately, this does not work because of how MHA was changed to handle the need_weights parameter. When need_weights is present, we do not (any more) call SDPA because support for need_weights was removed from SDPA before the release. The rationale is that need_weights defeats all optimization at the foundation of SDPA performance. Having the flag might thus mislead users into thinking they get good performance and have them disappointed when they enable a legacy feature of MHA which massively degrades performance. (They might not think anything of enabling that, because it is on by default in MHA today, which leads to more issues.) Since SDPA does not (no longer) support need_weights, we need to pick a separate path which implements attention using a set of discrete operations that allocates a tensor for weights. Alas, this code path does not have support for is_causal, because attention is implemented as matmul and using the attention mask. Thus, is_causal has no impact. (A substantially similar situation arises with how kpm is implemented today because Nested Tensors are not supported by torch.compile() in 2.0) This problem was masked because all uses of legacy nn.MHA (and F.MHA) come through nn.Transformer* which called self-attention (i.e., nn.MHA) only ever with the attention mask attn_mask, and never with is_causal, a missed optimization opportunit that would have been addressed in a future performance update. Regrettably, always calling nn.MHA with attn_mask prevented diagnosing of the issue of not having a suitable attention mask when need_weights support was dropped from SDPA and a discrete implementation of attention was added for that scenario, and for the execution path with key_padding_mask. We have two options to address this issue: Solution 1: Whenever nn.MHA and F.MHA are executed with is_causal set, we internally create a causal mask at significant expense of allocating a tensor and filling it with a triangular causal matrix. This increases memory usage, and runtime, for allocating a causal mask. To add insult to injury, in all current (and likely future) execution scenarios, MHA is called by a model using the nn.Transformer API which already has that matrix and passes it from nn.module to nn.module. Then the passing in of attn_mask has to be suppressed by nn.TransformerEncoderLayer, only for nn.MHA to immediately allocate the very same tensor again to satisfy the requirement to have an attention mask for the computation. (We expect new use cases to use SDPA directly.) Solution 2: We align the behavior of nn.MHA and F.MHA with the rest of the existing nn.Transformer API, and require the attention mask to be passed into nn.MHA in addition to is_causal as an optional indicator about the nature of the attention mask rather than as an alternative to attn_mask. Then, when we choose the code path for processing MHA with need_weights or a key_padding_mask, we have the attn_mask passed down through the nn.Transformer* hierarchy, without the added overhead of allocating an attention mask as in scenario 1. This PR implements solution 2 which offers better performance and in retrospect aligns MHA better with the rest of the Transformer modules as the definition of SDPA evolved into a more streamlined high-performance operator. It ostensibly changes how is_causal works, by requiring the attention mask to be specified. However, as described here, and as shown in the submitted issue, is_causal is not working as intended today, so it requires a change regardless. In that sense, a change in API does not occur per-se, as the current implementation is not working, and a change has to occur either way to resolve the submitted issue, breaking any use cases that depend on the current implementation. Checks exist (and more can be added) that flag any scenarios where is_causal is passed as True, but no attention mask is provided, ensuring that there's not quiet change from even the faulty behavior present in 2.0. As an upside, the present implementation will improve performance by addressing the passing of the is_causal flag from Transformer modules to MHA, speeding up training for these examples, e.g., finetuning BERT, RoBERTa, XLM-R models. Differential Revision: D44245725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97214 Approved by: https://github.com/albanD	2023-03-25 01:36:30 +00:00
CedricPicron	cf0ba1b9c0	Use L1 loss for Smooth L1 loss with beta=0 (#97022 ) Fixes #96813. Comments: 1. Wasn't able to test since tools/nightly.py does not allow for GPU build (and I don't want to build from scratch). 2. In theory, the bug (i.e. NaNs) can still occur when beta is very small (e.g. `beta=1e-50`), but not sure whether anybody cares. 3. Some checks within the smooth_l1_loss C++ code could be changed to check for `beta > 0` instead of `beta >= 0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97022 Approved by: https://github.com/jbschlosser	2023-03-24 19:10:32 +00:00
Xiao Wang	1716709d46	[CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs [v2] (#96586 ) Fixes #96429 This PR is also a follow up for #90427. In that PR, we also discussed whether calculations of grid indices `grid_sampler_compute_source_index` should also be upcasted to `opmath_t` https://github.com/pytorch/pytorch/pull/90427/files#r1048876708. Due to another unit test failure, we didn't upcast those calculations in that PR. After some investigations, I found that the inaccurate results have nothing to do with the internals of `affine_grid`, even if it's calculated using `double` internally. As long as input `grid` is passed to `grid_sample` in half precision, the results will be less inaccurate than a float `grid`. This can be verified with a short C++ program like this (by setting `TYPE_T` to `__half` and `float` in compilations) ```cpp #include <cuda.h> #include <cuda_runtime.h> #include <cuda_fp16.h> #include <iostream> #ifndef TYPE_T #define TYPE_T float #endif int main() { using type_t = TYPE_T; type_t d = static_cast<__half>((double)2.0 / 3.0); type_t s = (((float)d + 1.f) * 3 - 1) / 2; printf("%.15f %.15f\n", (double)d, (double)s); } ``` Outputs are ``` ./float.out 0.666503906250000 1.999755859375000 ./half.out 0.666503906250000 2.000000000000000 ``` To resolve the discussion back in https://github.com/pytorch/pytorch/pull/90427/files#r1048876708, I've also increased the test tolerance in the failed unit test `issue_24823_1(torch.half)`. For the original script in #96429, I got more accurate results with `align_corners = True` ``` align_corners = True Expected result has mean absolute value of 0.5285 and maximum absolute value of 3.2067. Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum. align_corners = False Expected result has mean absolute value of 0.5189 and maximum absolute value of 3.0101. Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96586 Approved by: https://github.com/ngimel	2023-03-15 19:25:20 +00:00
Eddie Yan	70090b4daf	[CUDA] Abate spurious resize warnings in MultiMarginLoss backward (#96382 ) Follow-up of #75000 for backward. CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/96382 Approved by: https://github.com/ngimel	2023-03-14 05:54:23 +00:00
soulitzer	7ff9612e34	Improve error message for instance norm when channels is incorrect (#94624 ) Fixes https://github.com/pytorch/pytorch/issues/90514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94624 Approved by: https://github.com/jbschlosser	2023-03-04 02:06:48 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
ganler	0176405c69	fix: check if double to i64 is in well-formed range (#94290 ) Fixes #88951 The output shape of upsample is computed through `(i64)idim * (double)scale` and then casted back to `i64`. If the input scale is ill-formed (say negative number as #88951) which makes `(double)(idim * scale)` to be out of the range for `i64`, the casting will be an undefined behaviour. To fix it, we just check if `(double)(idim * scale)` can fit into `i64`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94290 Approved by: https://github.com/malfet	2023-02-10 22:35:22 +00:00
Jiayi Sun	01de5ddafc	add mixed data type support for LayerNorm backward on CPU (#88064 ) ### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm. ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| bf16 forward (ms) \| mix forward (ms) \| fp32 backward (ms) \| bf16 backward (ms) \| mix backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.071 \| 0.065 \| 0.062 \| \| (8, 8, 16) \| 0.015 \| 0.014 \| 0.015 \| 0.074 \| 0.070 \| 0.063 \| \| (32, 8, 16) \| 0.062 \| 0.016 \| 0.016 \| 0.073 \| 0.073 \| 0.072 \| \| (64, 128, 56, 56) \| 2.467 \| 0.907 \| 0.0897 \| 12.993 \| 7.603 \| 7.777 \| \| (64, 128, 256, 256) \| 48.904 \| 25.589 \| 25.472 \| 343.992 \| 183.133 \| 188.222 \| Single core(icx): \| shape \| fp32 forward (ms) \| bf16 forward (ms) \| mix forward (ms) \| fp32 backward (ms) \| bf16 backward (ms) \| mix backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.050 \| 0.050 \| 0.050 \| \| (8, 8, 16) \| 0.014 \| 0.014 \| 0.014 \| 0.052 \| 0.054 \| 0.053 \| \| (32, 8, 16) \| 0.034 \| 0.019 \| 0.018 \| 0.059 \| 0.067 \| 0.066 \| \| (64, 128, 56, 56) \| 66.791\| 17.725 \| 19.799 \| 119.431 \| 106.123 \| 107.446 \| \| (64, 128, 256, 256) \| 1542.477 \| 402.132 \| 527.044 \| 3019.437 \| 2336.318 \| 2448.320 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/88064 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-02-10 03:10:14 +00:00
Nicolas Hug	544c04f2df	Add uint8 support for interpolate for CPU images (#90771 ) Joint work with @vfdev-5 This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ). On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`: ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (Note: we removed bicubic support for now) (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms ``` There is still room for further speed-ups (see TODOs in the code). #### More benchmark details with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why. <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 5X 1.1ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 12X 2.9ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 3X 0.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 7X 1.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 2.6X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 1.7X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 2.7X 0.7ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 1.8X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 4X 1.0ms vs 0.2ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 4X 2.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 3.0X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 3X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 4X 2.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 4X 2.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 7X 4.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 3X 2.1ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 4X 2.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 2.7X 1.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 2.6X 1.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 2.1X 1.2ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 2.8X 1.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 5X 2.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 2.3X 1.4ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 3X 1.9ms vs 0.6ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 4X 26.6ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 4X 23.9ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 2.5X 16.8ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 5X 33.1ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 4X 25.9ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 8X 59.6ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.9X 14.3ms vs 7.4ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 5X 35.4ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 2.0X 13.6ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 2.2X 14.8ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 1.3X 8.8ms vs 6.9ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.2X 8.4ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.8X 12.8ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 4X 32.1ms vs 7.2ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.4X 10.1ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.9X 20.9ms vs 7.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 2.1X 0.7ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 1.9X 0.6ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.6X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.8X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.2X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 1.2X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 1.5X 1.0ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.2X 0.8ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 2.3X 1.5ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.9X 1.2ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.6X 1.2ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 4X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 2.4X 1.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 2.8X 1.8ms vs 0.6ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 2.1X 12.8ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.6X 3.8ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 1.2X 7.1ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.9X 11.0ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 2.0X 12.6ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 6.1ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 1.8X 11.3ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 4.6ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.6X 9.3ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.3X 2.0ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.2X 7.2ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.3X 1.6ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.1X 7.1ms vs 6.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 0.6X 3.3ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 0.9X 5.9ms vs 6.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.4X 2.4ms vs 5.9ms ``` </details> without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs) <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 0.8X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 1.5X 1.7ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 0.9X 1.6ms vs 1.8ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 2.1X 3.9ms vs 1.9ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 0.8X 1.1ms vs 1.4ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 1.7X 2.4ms vs 1.5ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 0.9X 0.5ms vs 0.6ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 0.7X 0.5ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 0.9X 0.9ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 2.1X 2.0ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 0.8X 0.6ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 1.7X 1.3ms vs 0.8ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 1.0X 3.0ms vs 3.0ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 1.0X 2.8ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 1.0X 2.3ms vs 2.2ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 1.4X 3.3ms vs 2.3ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 1.0X 3.5ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 1.7X 6.1ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 0.9X 2.6ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 1.4X 4.2ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 1.0X 1.7ms vs 1.7ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 0.9X 1.6ms vs 1.8ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 0.9X 1.3ms vs 1.4ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 0.7X 1.1ms vs 1.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 1.0X 2.0ms vs 2.0ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 1.7X 3.2ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 0.8X 1.5ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 1.2X 2.3ms vs 1.9ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 1.1X 34.7ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 1.0X 31.2ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 1.0X 23.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 1.9X 42.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 0.9X 33.9ms vs 37.4ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 2.2X 84.0ms vs 37.5ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.0X 28.4ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 2.0X 56.7ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 1.1X 17.5ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 1.1X 17.7ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 0.8X 8.8ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.0X 11.1ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.1X 19.9ms vs 18.8ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 2.3X 42.5ms vs 18.7ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.0X 14.1ms vs 14.5ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.0X 28.4ms vs 14.5ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.0X 0.6ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.3ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 0.9X 0.5ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.7X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 1.0X 0.8ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.1X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 0.9X 0.7ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 0.9X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.8X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.3X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 0.9X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 1.2X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 0.8X 2.1ms vs 2.5ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 0.7X 1.6ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 1.2X 2.4ms vs 2.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 1.3X 2.6ms vs 2.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 1.1X 3.4ms vs 3.0ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 1.7X 4.8ms vs 2.8ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 1.1X 2.9ms vs 2.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 1.4X 3.5ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 0.9X 1.2ms vs 1.3ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.3X 1.6ms vs 1.2ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 0.8X 0.9ms vs 1.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.3X 1.3ms vs 1.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.4X 2.2ms vs 1.6ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 1.9X 2.8ms vs 1.5ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 0.8X 1.1ms vs 1.4ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 1.7X 2.1ms vs 1.3ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 1.0X 10.0ms vs 9.9ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.7X 4.6ms vs 6.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 0.9X 9.1ms vs 9.8ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.7X 9.4ms vs 5.7ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 1.0X 15.2ms vs 14.8ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 7.6ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 0.9X 13.3ms vs 14.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 5.9ms vs 7.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.2X 6.0ms vs 5.2ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.7X 2.3ms vs 3.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.0X 4.8ms vs 5.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.7X 1.9ms vs 2.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.6X 12.3ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 1.0X 3.9ms vs 3.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 1.0X 7.0ms vs 7.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.9X 3.0ms vs 3.5ms ``` </details> Benchmark code <details> ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu') if channels_last: input_image = input_image.contiguous(memory_format=torch.channels_last) self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "antialias": antialias, "dtype":dtype, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, antialias, dtype): if dtype == torch.float: input_image = input_image.float() out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias) if dtype == torch.float: out = out.round().clamp(min=0, max=256).to(torch.uint8) def make_config(): sizes = ( ((224, 224), (64, 64)), ((270, 268), (224, 224)), ((256, 256), (1024, 1024)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels # attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels # attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True, False], 'mode': ["bilinear", "bicubic"], 'antialias': [True, False], # 'dtype': [torch.float, torch.uint8] # 'dtype': [torch.uint8] 'dtype': [torch.float] }, tags=["short"], ) return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` ```py import re import argparse parser = argparse.ArgumentParser() parser.add_argument("f1", nargs="?", default="main") parser.add_argument("f2", nargs="?", default="new") args = parser.parse_args() with open(args.f1) as f: main = f.readlines() with open(args.f2) as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): # num_threads=1 # TODO: remove if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("antialias=", "") deets = deets.replace("channels_last=", "") # deets = deets.replace("channels_last=True, ", "") split = deets.split(",") # size = ','.join(split[:-3]) # mode, dtype, threads = split[-3:] # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" size = ','.join(split[:-5]) channels_last, mode, antialias, dtype, threads= split[-5:] deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("\(.?\)", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: # assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 8 == 0: print() # if i % 10 == 0 and i % 40 != 0: # print() # if i % 40 == 0: # print("-" 100) print(l) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90771 Approved by: https://github.com/peterbell10, https://github.com/ngimel	2023-02-10 01:43:54 +00:00
ecao	81e318353f	Align input memory format and grad memory format for GroupNorm backward (#92668 ) Fixes the skipped part of the test on https://github.com/pytorch/pytorch/pull/92671. Align the input memory format and the grad memory format for GroupNorm backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92668 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-02-09 08:56:43 +00:00
Nikita Shulga	768e547543	Fix SIGFPE in slow_conv3d_forward_out_cpu (#94325 ) Set number of groups to 0 if weights second dimension is zero. `slow_conv_shape_check` will raise an exception if groups are zero anyway. Fixes SIGFPE reported in https://github.com/pytorch/pytorch/issues/94125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94325 Approved by: https://github.com/albanD	2023-02-08 14:15:39 +00:00
Aaron Gokaslan	3ce1ebb6fb	Apply some safe comprehension optimizations (#94323 ) Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323 Approved by: https://github.com/albanD	2023-02-07 23:53:46 +00:00
Vivswan Shah	8c1ee89f19	Added super init to Module (#91819 ) Added super init to Module for complex user modules derived from multiple python classes. And by adding the super __init__ call at the end so it doesn't change any functionality of Module class. I am working on building a module for simulating analog neural network on PyTorch. and this small change is really useful for that and we can definitely think of many other useful cases especially for more module or mro hierarchy. Issues: https://github.com/pytorch/pytorch/issues/28746, https://github.com/pytorch/pytorch/issues/48626, https://github.com/pytorch/pytorch/issues/61662, https://github.com/pytorch/pytorch/issues/74036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91819 Approved by: https://github.com/albanD	2023-02-01 22:17:59 +00:00
Michael Gschwind	64d0624cee	Explicit Name needed to run with buck test (#93035 ) Summary: Explicit Name needed to run with buck test Test Plan: sandcastle Differential Revision: D42763774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93035 Approved by: https://github.com/cpuhrsch	2023-01-27 14:36:46 +00:00
Jane Xu	b90496eef5	[nn] zero_grad() set_to_none default True (#92731 ) Attempts to fix #92656 BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731 Approved by: https://github.com/ngimel	2023-01-26 01:04:28 +00:00
Nikita Shulga	97b7e4cdd5	Fix GroupNorm backward prop on CUDA (#92671 ) Fixes regression introduced by https://github.com/pytorch/pytorch/pull/89485 Adds test to prevent those regressions from happening in the future In process, discovered that GroupNormBackwards on CPU does not produce the same results if input and gradient memory_format is different Fixes #92166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92671 Approved by: https://github.com/ngimel, https://github.com/xuzhao9	2023-01-20 22:22:01 +00:00
milesial	e4d83d54a6	Foreach gradient clipping (#91846 ) Faster gradient clipping using the foreach functions ``` [------------------------ (tensors, scalar) -------------------------] \| without foreach \| with foreach \| apex 1 threads: ---------------------------------------------------------------------- 10 tensors of size 4 \| 120.5 \| 61.1 \| 50.3 100 tensors of size 4 \| 946.2 \| 239.5 \| 136.3 1000 tensors of size 4 \| 9808.5 \| 2151.1 \| 1006.9 10000 tensors of size 4 \| 96871.2 \| 22637.4 \| 10119.1 10 tensors of size 16 \| 121.0 \| 64.1 \| 52.5 100 tensors of size 16 \| 993.4 \| 252.6 \| 136.7 1000 tensors of size 16 \| 9427.7 \| 2151.2 \| 1049.5 10000 tensors of size 16 \| 97437.1 \| 22203.1 \| 10340.0 10 tensors of size 256 \| 118.9 \| 62.3 \| 51.5 100 tensors of size 256 \| 955.2 \| 243.1 \| 134.2 1000 tensors of size 256 \| 9374.9 \| 2140.7 \| 1009.6 10000 tensors of size 256 \| 95302.5 \| 21849.4 \| 10215.5 10 tensors of size 65536 \| 118.5 \| 62.4 \| 51.1 100 tensors of size 65536 \| 1740.7 \| 243.3 \| 225.3 1000 tensors of size 65536 \| 17364.1 \| 2228.7 \| 2004.5 10000 tensors of size 65536 \| 177510.1 \| 25410.4 \| 20678.2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91846 Approved by: https://github.com/janeyx99	2023-01-20 21:43:29 +00:00
vfdev-5	5f55335c2e	Fixed output memory format mismatch for bicubic2d (#90470 ) Description: - output memory format is matching input for bicubic2d Problem: output tensor's memory format does not match input format for bicubic2d ```python import torch i = torch.rand(1, 3, 32, 32).contiguous(memory_format=torch.channels_last) assert i.is_contiguous(memory_format=torch.channels_last) o = torch.nn.functional.interpolate(i, size=(4, 4), mode="bicubic") assert o.is_contiguous(memory_format=torch.channels_last), f"Should be channels last but given channels first ({o.is_contiguous(memory_format=torch.contiguous_format)})" > AssertionError: Should be channels last but given channels first (True) ``` Related PR fixing bilinear ops: https://github.com/pytorch/pytorch/pull/53535 (cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @bdhirsh ) Discovered together with @NicolasHug while working on https://github.com/pytorch/pytorch/tree/interpolate_uint8_images_linear_cpu_support_dev - Updated code to match grad input / output memory formats - temporary tensor creation matches memory format in `separable_upsample_generic_Nd_kernel_impl` - Updated tests - Added missing forward AD support for bicubic with antialiasing Pull Request resolved: https://github.com/pytorch/pytorch/pull/90470 Approved by: https://github.com/NicolasHug, https://github.com/lezcano	2023-01-12 19:52:28 +00:00
Aleksandar Samardžić	8612ec5b90	Implement hybrid sparse to/from dense conversions. (#90177 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90177 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-01-12 03:31:30 +00:00
anjali411	c887837ec3	Reland "Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463 )" (#91897 ) This reverts commit `84266ae670`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91897 Approved by: https://github.com/ngimel	2023-01-10 08:16:07 +00:00
ecao	5030929c5d	add channels last with mixed data type support for GroupNorm backward (#89485 ) ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 3.20E-05 \| 3.60E-05 \| 8.31E-05 \| 8.13E-05 [10, 128, 50, 50] \| 0.000126 \| 0.000115 \| 0.000356 \| 0.000257 * Channels Last: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 4.11E-05 \| 4.12E-05 \| 9.74E-05 \| 9.66E-05 [10, 128, 50, 50] \| 0.000179 \| 0.000178 \| 0.000393 \| 0.000317 Single core: * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.47E-04 \| 2.53E-04 \| 5.92E-04 \| 4.50E-04 [10, 128, 50, 50] \| 0.001559 \| 0.001384 \| 0.004343 \| 0.002436 * Channels Last: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.27E-04 \| 3.24E-04 \| 0.0006224 \| 0.000459 [10, 128, 50, 50] \| 0.00167 \| 0.001278 \| 0.0041858 \| 0.003027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89485 Approved by: https://github.com/jgong5, https://github.com/malfet	2022-12-29 07:19:39 +00:00
ecao	59a5be3b45	add mixed data type support for GroupNorm backward on CPU (#88663 ) ### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 3.08E-05 \| 3.50E-05 \| 8.06E-05 \| 7.69E-05 [10, 128, 50, 50] \| 0.000121 \| 0.000114 \| 0.000358 \| 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 4.04E-05 \| 4.41E-05 \| 0.000226 \| 0.000305 [10, 128, 50, 50] \| 0.000169 \| 0.000166 \| 0.001628 \| 0.001169 Single core: * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.38E-04 \| 2.51E-04 \| 5.94E-04 \| 4.50E-04 [10, 128, 50, 50] \| 0.00171 \| 0.001395 \| 0.0044455 \| 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.28E-04 \| 3.26E-04 \| 0.0016528 \| 0.003165 [10, 128, 50, 50] \| 0.001788 \| 0.001302 \| 0.0276621 \| 0.019447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88663 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/malfet	2022-12-22 01:12:42 +00:00
mingfeima	4bf22fcfe2	add mixed data type support for GroupNorm (#81852 ) 1. If user uses amp to run bfloat16 models, `torch.autocast` will keep module paramters in acc dtype which will leave `gamma` and`beta` in float while input/output will be in bfloat16. 2. If user explicitly cast the model to bfloat16, the input/output and gamma/beta will all be in bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81852 Approved by: https://github.com/jgong5, https://github.com/malfet	2022-12-19 07:59:40 +00:00
Xiao Wang	670efb974a	[CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs (#90427 ) Fixes https://github.com/pytorch/pytorch/issues/89836 This PR changes the CUDA kernels of grid_sample 2d and 3d, forward, to use accumulate type to improve accuracy on half precision inputs. Also, the backward error on grad with half input is in the order of 1e-4, unlike 1e2 in forward process. The backward kernels are thus unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90427 Approved by: https://github.com/ngimel	2022-12-15 03:41:35 +00:00
Rohan Varma	9c80f13692	[Resubmit] state_dict_pre_hook (#90435 ) Resubmit of https://github.com/pytorch/pytorch/pull/88541 which got stale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90435 Approved by: https://github.com/fegin	2022-12-08 07:54:14 +00:00
Sergii Dymchenko	f09e7b5ce7	Replace assertEqualIgnoreType in test_nn.py (#90242 ) See https://github.com/pytorch/pytorch/issues/38095. Also removed some redundant separate `dtype` checks when `dtype` is already checked by the next line's `assertEqual`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90242 Approved by: https://github.com/malfet	2022-12-06 22:34:01 +00:00
PyTorch MergeBot	cba96366a2	Revert "remove torch.equal usages (#89527 )" This reverts commit `4095ef8b80`. Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests `4095ef8b80` https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502	2022-12-02 21:36:13 +00:00
Philip Meier	4095ef8b80	remove torch.equal usages (#89527 ) Preparation for the next PR in this stack: #89559. I replaced - `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`, - the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and - `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default). There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527 Approved by: https://github.com/mruberry	2022-12-01 11:22:52 +00:00
mingfeima	f1978b18f9	add mixed data type support for LayerNorm (#81851 ) 1. If user uses amp to run bfloat16 models, `torch.autocast` will keep module paramters in acc dtype which will leave `gamma` and`beta` in float while input/output will be in bfloat16. 2. If user explicitly cast the model to bfloat16 such as: ``` x = torch.randn(n, t, c).bfloat16() ln = nn.LayerNorm(c).bfloat16() y = ln(x) ``` The input/output and gamma/beta will all be in bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81851 Approved by: https://github.com/ezyang	2022-12-01 04:48:34 +00:00
kshitij12345	8314d403a6	[test_nn] split multihead_attention from test_nn (#89748 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89748 Approved by: https://github.com/albanD	2022-11-29 18:15:18 +00:00
Jiong Gong	620994cd7a	Guard the boundary of index computed in compute_source_index_and_lambda (#89252 ) Improve the fix in https://github.com/pytorch/pytorch/pull/89210 See discussion in https://github.com/pytorch/pytorch/issues/89212#issuecomment-1318911969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89252 Approved by: https://github.com/mingfeima, https://github.com/weiwangmeta	2022-11-29 13:55:22 +00:00
Yuxin Wu	56e40fe054	Let SyncBatchNorm fallback to BN if not using distributed training (#89706 ) Fixes #63662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89706 Approved by: https://github.com/soumith	2022-11-27 05:55:24 +00:00
kshitij12345	d3c012f409	[test_nn] split pruning tests from test_nn (#89590 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89590 Approved by: https://github.com/albanD	2022-11-24 21:41:22 +00:00
Nikita Karetnikov	0a1a53083e	[primTorch] Enable regex error testing for some refs (#87765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87765 Approved by: https://github.com/mruberry	2022-11-23 23:36:27 +00:00
kshitij12345	1333fdcff1	[test_nn] split parametrization test from test_nn (#89552 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89552 Approved by: https://github.com/albanD	2022-11-23 17:27:40 +00:00
Kshiteej K	c651944f92	[test_nn] split hooks test from test_nn (#89201 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89201 Approved by: https://github.com/albanD	2022-11-23 08:39:45 +00:00
Kshiteej K	dd140fc351	[test_nn] move init tests from test_nn (#89202 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89202 Approved by: https://github.com/albanD	2022-11-23 08:30:51 +00:00
ecao	3beccbc299	Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU (#82460 ) ### Description * add BFloat16 support for mish and hardtanh backward on CPU. * optimize the performance for silu ### Testing - optimize the performance for silu: bfloat16 single socket (28 cores): ``` before: 1x128x1024 forward 0.090 s backward 0.218 s 10x128x1024 forward 0.146 s backward 0.314 s after: 1x128x1024 forward 0.064 s backward 0.100 s 10x128x1024 forward 0.085 s backward 0.133 s ``` single core: ``` before: 1x128x1024 forward 0.300 s backward 0.606 s 10x128x1024 forward 2.825 s backward 5.834 s after: 1x128x1024 forward 0.156 s backward 0.239 s 10x128x1024 forward 1.447 s backward 2.165 s ``` - Add BFloat16 support for mish and backward of hardtanh on CPU. single socket (20 cores): op \| shape \| fp32 / s \| fp32 / s \| bf16 / s \| bf16 / s -- \| -- \| -- \| -- \| -- \| -- \| \| forward \| backward \| forward \| backward silu \| [10, 128, 10, 10] \| 4.41E-05 \| 7.67E-05 \| 5.32E-05 \| 9.38E-05 \| [10, 128, 80, 80] \| 0.0008 \| 0.001788 \| 0.00067 \| 0.001031 mish \| [10, 128, 10, 10] \| 0.000356 \| 0.000427 \| 0.000367 \| 0.000436 \| [10, 128, 80, 80] \| 0.004527 \| 0.005807 \| 0.004757 \| 0.005393 hardtanh \| [10, 128, 10, 10] \| / \| 3.97E-05 \| / \| 4.45E-05 \| [10, 128, 80, 80] \| / \| 0.001748 \| / \| 0.000645 single core: op \| shape \| fp32 / s \| fp32 / s \| bf16 / s \| bf16 / s -- \| -- \| -- \| -- \| -- \| -- \| \| forward \| backward \| forward \| backward silu \| [10, 128, 10, 10] \| 1.17E-04 \| 1.91E-04 \| 1.35E-04 \| 2.23E-04 \| [10, 128, 80, 80] \| 0.007434 \| 0.013141 \| 0.008464 \| 0.013044 mish \| [10, 128, 10, 10] \| 0.00103 \| 0.00122 \| 0.00106 \| 0.001227 \| [10, 128, 80, 80] \| 0.065629 \| 0.078418 \| 0.067779 \| 0.077214 hardtanh \| [10, 128, 10, 10] \| / \| 1.18E-04 \| / \| 9.30E-05 \| [10, 128, 80, 80] \| / \| 0.010773 \| / \| 0.005834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82460 Approved by: https://github.com/mingfeima, https://github.com/malfet	2022-11-17 08:15:52 +00:00
ecao	44c9185f91	Fix empty input issue of convolution for channels last memory format (#86521 ) Fixes empty input convolution issue : when input is empty e.g. shape of (0, 3, 3, 4) and weight is channels last format, at::_unsafe_view will raise "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead." Pull Request resolved: https://github.com/pytorch/pytorch/pull/86521 Approved by: https://github.com/jgong5, https://github.com/malfet	2022-11-17 04:47:45 +00:00
Jerry Zhang	1adb7b9b84	[nn][utils] Preserve requires_grad from original weight and bias in fuse conv/linear bn weights (#89100 ) Summary: att, previously we just call nn.Parameter which will have requires_grad=True by default, after this PR we will preserve the requires_grad Test Plan: python test/test_nn.py TestFusionUtils Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41343694](https://our.internmc.facebook.com/intern/diff/D41343694) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89100 Approved by: https://github.com/ngimel	2022-11-17 03:58:16 +00:00
Xiao Wang	f5df685090	Enable channels_last_3d on SyncBatchNorm (#88401 ) This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format. With a small benchmark script here https://github.com/pytorch/pytorch/issues/88021#issuecomment-1299059859, on V100, I got master: ``` DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec ``` This PR: ``` DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec ``` This PR is a follow-up of https://github.com/pytorch/pytorch/pull/46906 Close https://github.com/pytorch/pytorch/issues/88021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88401 Approved by: https://github.com/ngimel	2022-11-15 19:25:53 +00:00
Grigory Sizov	7ad87f63e2	Support src_mask and src_key_padding_mask for Better Transformer (#88488 ) Fixes T135842750 (follow-up for #87377) ## Description At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention. This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream. Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device: - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported. - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask. ## Tests - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA - `test_masked_softmax_mask_types` now covers mask type 2 - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488 Approved by: https://github.com/mikekgfb	2022-11-10 08:12:56 +00:00
Samantha Andow	87238e6491	[nn] add remove_duplicate flag to named_parameters (#759 ) (#88090 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/759 Since the remove_duplicate flag was added to named_buffers in D39493161 (`c12f829cce`), this adds the same flag to named_parameters Test Plan: python test/test_nn.py -k test_buffers_and_named_buffers OSS Tests Differential Revision: D40801899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88090 Approved by: https://github.com/albanD	2022-11-09 00:09:20 +00:00
Nikita Karetnikov	bbaa0637df	Add error inputs to `gaussian_nll_loss` `OpInfo` (#88486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88486 Approved by: https://github.com/lezcano	2022-11-05 20:10:54 +00:00
Philip Meier	bc73affdad	prepare removal of deprecated functionality in torch.testing (#87969 ) _Redo of #86586 with all BC breaking changes granularly placed into separate commits._ --- Per title. Deprecation happened on Feb 25, 2022 in `c6f1bbc0ac`, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969 Approved by: https://github.com/mruberry	2022-11-02 14:04:48 +00:00
Grigory Sizov	4c78c7c82a	Enable `src_mask` in fast path of `TransformerEncoderLayer` (#87377 ) ## Issues Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674 ## Description Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path: - Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type. - If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used ## Tests: - `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask - `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation - `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match ## Note I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason: - `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26) - If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests - Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double` Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377 Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet	2022-10-31 19:59:36 +00:00
Kshiteej K	6735bf21c7	[test_nn] split convolution tests from test_nn (#87474 ) Ref #63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87474 Approved by: https://github.com/albanD	2022-10-31 04:42:45 +00:00
Eddie Yan	c5cb6ec066	Allow 64bit indexing for channels-last upsample2d on CUDA (#87901 ) #81665 CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/87901 Approved by: https://github.com/ngimel	2022-10-28 19:33:42 +00:00
eqy	4c8e1a9829	Fix 64bit indexing in `vol2col` (#87527 ) Surfaced from #87354 CC @ngimel @ptrblck @maybeLee Pull Request resolved: https://github.com/pytorch/pytorch/pull/87527 Approved by: https://github.com/ngimel	2022-10-23 21:17:12 +00:00
Antonio Kim	6b59d9b566	Fix registration hooks (#87369 ) There is a bug in the implementation of the registration hooks introduced in https://github.com/pytorch/pytorch/pull/86148 whereby if the hook returns a tensor, then the short circuiting logic: ``` value = hook(self, name, value) or value ``` Raises an exception ``` RuntimeError: Boolean value of Tensor with more than one value is ambiguous ``` Fixing the logic so that it only checks to see if the value is `None` before overriding Fixes #85837 CC: @albanD @jbschlosser Pull Request resolved: https://github.com/pytorch/pytorch/pull/87369 Approved by: https://github.com/albanD	2022-10-21 05:12:25 +00:00
Rui Zhu	4b757f4633	Assert if padding mask type is unexpected (#86353 ) (#87106 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353 Fix the issue described in https://github.com/pytorch/pytorch/issues/86120 Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad Differential Revision: D40129968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106 Approved by: https://github.com/malfet	2022-10-20 16:01:54 +00:00
Kshiteej K	54ee95c8ec	[nn] module: full_backward_pre_hook (#86700 ) Fixes https://github.com/pytorch/pytorch/issues/42824 * [x] Test * [x] Doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/86700 Approved by: https://github.com/soulitzer	2022-10-13 17:36:39 +00:00
CaoE	b79bac0e4d	Make the data types of output and input consistenst for batchnorm (#84410 ) The model TTS will crash due to the issue:: when input of BN is not contiguous and the data type of input is different with that of parameters, BN will raise error `RuntimeError: !needs_dynamic_casting<func_t>::check(iter) INTERNAL ASSERT FAILED at "xxx/pytorch/aten/src/ATen/native/cpu/Loops.h":311, please report a bug to PyTorch`. Make the data types of output and input consistenst for batchnorm to fix the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84410 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2022-10-13 00:42:46 +00:00
Antonio Kim	09a676f639	Add hooks for register_buffer/module/parameter (#86148 ) As described in the issue, this PR adds hooks to be run when `register_parameter`, `register_buffer` and `register_module` are called. Fixes #85837 cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are Pull Request resolved: https://github.com/pytorch/pytorch/pull/86148 Approved by: https://github.com/albanD	2022-10-12 20:57:22 +00:00
Nikita Karetnikov	d56017a14f	[primTorch] Add ref for `triplet_margin_loss`, improve `triplet_margin_with_distance_loss` (#85614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85614 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-12 18:37:58 +00:00
Nikita Shulga	9eb4f9dd17	Tweak test tolerances to be compatible with A10G (#86538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538 Approved by: https://github.com/ngimel	2022-10-11 23:31:48 +00:00
Jerry Zhang	c12f829cce	[nn] Add remove_duplicate flag to named_buffers (#674 ) (#85903 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84984 this is to allow named_buffers to return the same buffer objects with different names multiple times, needed by internal use cases ghstack-source-id: 168589597 Test Plan: python test/test_nn.py -k test_buffers_and_named_buffers Imported from OSS Reviewed By: albanD Differential Revision: D39493161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85903 Approved by: https://github.com/albanD	2022-10-11 18:49:09 +00:00
Kshiteej K	e18d466f35	[test_nn] split lazy_modules from test_nn (#86526 ) Ref: #63085 NOTE: We don't need an accompanying XLA PR as these tests run only on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86526 Approved by: https://github.com/albanD	2022-10-10 16:29:56 +00:00
Pearu Peterson	6b295cd046	Enable autograd on Linear with sparse COO weight (#86302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86302 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:31 +00:00
Pearu Peterson	f104490d63	Support autograd on Linear with sparse compressed weight. (#86137 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86137 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:25 +00:00
Kshiteej K	6a5550fca4	[test_nn] split embedding tests from test_nn (#85892 ) Ref https://github.com/pytorch/pytorch/issues/63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85892 Approved by: https://github.com/albanD	2022-09-30 21:45:40 +00:00
lezcano	787028cadb	Implement col2im decomposition and fix im2col and add a few preconditions (#85541 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/85541 Approved by: https://github.com/jansel	2022-09-30 09:31:53 +00:00
George Qi	85258ec17e	Add mask_type=2 to masked_softmax for when mask.size() == input.size() (#85915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85915 Approved by: https://github.com/cpuhrsch	2022-09-29 23:13:37 +00:00
Masaki Kozuki	ef0baba23f	Use `int64_t` for nll_loss with cuda inputs (#85395 ) Related #85005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85395 Approved by: https://github.com/t-vi, https://github.com/lezcano	2022-09-29 17:02:04 +00:00
Mikayla Gawarecki	afaee00fec	Add python `nested_tensor` and `as_nested_tensor` constructors in `torch.nested` (#85593 ) Remove `torch.nested_tensor` which has erroneous behavior wrt gradients (could be either leaf or not leaf). Introduce `torch.nested.nested_tensor` and `torch.nested.as_nested_tensor` in the vein of `torch.tensor` and `torch.as_tensor`. Done in nested `__init__.py` for now but can move to pybind in future (when we want to load from numpy/nested lists ). Discussed offline with @cpuhrsch and pybind constructor (https://github.com/pytorch/pytorch/pull/85536) was more gnarly than expected, so we can move to that when we do need loading from numpy etc. Differential Revision: [D39806622](https://our.internmc.facebook.com/intern/diff/D39806622) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85593 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2022-09-28 20:15:02 +00:00
Weiyi Zheng	b2311192e6	[NN module] speed up _load_from_state_dict (#85743 ) Fixes #61398 The original implementation is very slow when the state_dict.keys() is long. This PR only passes relevant keys to the child module. existing test passes: `pytest test/test_nn.py -k state_dict` I couldn't figure out a good way to write a new test for this new behavior. Had a new snippet, but it will be flaky if integrated into the main CI because it's a timing based check. But I can verify that the test took 30s to run, after this PR it only takes 0.5s. ```python def test_load_state_dict_large(self): # construct a module with 4 levels of module, 10 linear each, leads to 10k items in the dictionary import copy import time base_module = nn.Linear(1,1) model = base_module for level in range(4): model = nn.Sequential(*[copy.deepcopy(model) for _ in range(10)]) state_dict = model.state_dict() self.assertEqual(len(state_dict), 20000) st = time.time() model.load_state_dict(state_dict, strict=True) strict_load_time = time.time() - st # it took 0.5 seconds to self.assertLess(strict_load_time, 10) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/85743 Approved by: https://github.com/albanD	2022-09-28 15:26:03 +00:00
Eddie Yan	2bc82163eb	Reduce memory usage requirement of test_warp_softmax_64bit_indexing in test_nn.py (re-open of #85037 ) (#85373 ) CC @ngimel @xwang233 @ptrblck Adds fix for `get_tolerances`, tested locally on a dgx Volta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85373 Approved by: https://github.com/ngimel	2022-09-22 07:34:47 +00:00
Mikayla Gawarecki	77f1f98479	Re-introduce `torch.Tensor.to_padded_tensor` (#85293 ) Differential Revision: [D39629004](https://our.internmc.facebook.com/intern/diff/D39629004) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85293 Approved by: https://github.com/cpuhrsch	2022-09-21 18:45:56 +00:00
PyTorch MergeBot	53fdd60635	Revert "Reduce memory usage requirement of `test_warp_softmax_64bit_indexing` in `test_nn.py` (#85037 )" This reverts commit `66a9cba221`. Reverted https://github.com/pytorch/pytorch/pull/85037 on behalf of https://github.com/clee2000 due to broke test_warp_softmax_64bit_indexing_cuda_float32 and test_warp_softmax_64bit_indexing_cuda_float16 on rocm https://github.com/pytorch/pytorch/actions/runs/3085764744/jobs/4989643817	2022-09-20 00:13:41 +00:00
eqy	66a9cba221	Reduce memory usage requirement of `test_warp_softmax_64bit_indexing` in `test_nn.py` (#85037 ) For reference: #84944 CC @xwang233 @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85037 Approved by: https://github.com/ngimel, https://github.com/pmeier	2022-09-19 21:31:08 +00:00
Elias Ellison	f37069aac7	Re-enable fixed dynamo tests (#84969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84969 Approved by: https://github.com/bdhirsh, https://github.com/ezyang	2022-09-16 15:36:52 +00:00
Michael Melesse	b6d6a78c12	[ROCM] test_batchnorm_cudnn_nhwc (#84603 ) This pr enables test_batchnorm_cudnn_nhwc. This is a follow up to https://github.com/pytorch/pytorch/pull/82512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84603 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2022-09-14 15:50:14 +00:00
Mikayla Gawarecki	e217b30b0f	Add `torch.nested` namespace (#84102 ) First step towards #83775 - only `to_padded_tensor` is moved to the nested namespace for now - following the schema used for `special`, `fft`, `linalg` and other namespaces, nested functions are registered in native_functions.yaml as `nested_{function_name}` and are bound to the desired Python name in `torch/nested/__init__.py`, and the desired C++ name in `torch/csrc/api/include/torch/nested.h`. ~~Question: should we keep the documentation for `Tensor.to_padded_tensor` or can this deleted since it is shared by `torch.nested.to_padded_tensor`?~~ [generated nested docs](https://docs-preview.pytorch.org/84102/nested.html?highlight=nested#module-torch.nested) Differential Revision: [D39361148](https://our.internmc.facebook.com/intern/diff/D39361148) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84102 Approved by: https://github.com/drisspg	2022-09-12 16:31:05 +00:00
Kshiteej K	6d6e04d6cc	[test_nn] move dropout tests to test/nn/test_dropout.py (#84165 ) Ref https://github.com/pytorch/pytorch/issues/63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84165 Approved by: https://github.com/albanD	2022-09-03 07:21:48 +00:00
Elias Ellison	f701cb04fb	Test Dynamo CI w Fake Tensors (#84282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84282 Approved by: https://github.com/anijain2305	2022-09-01 00:15:05 +00:00
lezcano	b106a04d76	Fix the edge case when y = 0 in kl_div (#82714 ) Brought up in https://github.com/pytorch/pytorch/pull/80334#issuecomment-1193600883 We also prepare its opinfo to fix https://github.com/pytorch/pytorch/issues/80488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82714 Approved by: https://github.com/albanD	2022-08-30 18:18:25 +00:00
Edward Z. Yang	ad44670fa1	Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628 )" (#84173 ) Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature" Original commit changeset: dab4a9dba4fa Original commit changeset: dcaf16c037a9 Original Phabricator Diff: D38984222 Original Phabricator Diff: D39075159 Also update Metal registrations for C++ registration changes. Also update NNPI registration to account for tightened schema checking Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173 Approved by: https://github.com/Krovatkin	2022-08-29 18:01:07 +00:00
soulitzer	7088a98fba	conv2d: require bias to have the same dtype as input and weight on cpu (#83686 ) Fixes https://github.com/pytorch/pytorch/issues/83505 BC-breaking message: - Previously we only required input and weight to have the same dtype on cpu (when input is non-complex). After this change, the dtype of bias is now also expected to have the same dtype. This change was necessary to improve the error message for certain combinations of inputs. This behavior now also matches that of convolution on cuda. <details> <summary> Old plan </summary> Previously convolution (at least for slow_conv2d) did not perform type promotion, i.e. the output of `conv(int, int, float)` is an int, and that leads to the autograd assert. This PR adds type promotion handling at the `at::native::conv2d` (this is a composite) level. We also need to correct or remove many tests that assume that conv errors when input types are mixed Pros: - Doing type promotion at this level avoids the complex path from having any special handling for mixed dtypes, and can potentially speed up mixed dtype inputs to now dispatch to faster kernels which are only capable of handling floats. Cons: - Doing type promotion at this level has the risk of introducing extra overhead when we would've dispatched to a kernel capable of handle mixed type anyway. I don't know if any of these exist at all though - it is possible that inputs with any non-float arguments are dispatched to the slow path. If this approach is OK, we can proceed with the other convolutions as well: </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83686 Approved by: https://github.com/ngimel	2022-08-29 16:41:17 +00:00
Natalia Gimelshein	0ac2986d33	Fixes softmax indexing for large tensors (#84182 ) Fixes #84144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84182 Approved by: https://github.com/janeyx99	2022-08-29 04:29:09 +00:00
PyTorch MergeBot	c7edcd6968	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `9790d90e4b`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487	2022-08-27 01:23:17 +00:00
Animesh Jain	6a58603956	Update Dynamo pin (#83829 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/83829 Approved by: https://github.com/ezyang	2022-08-26 20:49:43 +00:00
Edward Z. Yang	9790d90e4b	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-26 01:35:40 +00:00
zaf	2f04ba2c7c	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-25 16:50:38 +00:00
XiaobingSuper	a013597b32	fix oneDNN channels_last path issue (#83653 ) Fix #82060(N>1 will call in OneDNN path) and #80837, those two issues are introduced by the definition of channels last is different between PyTorch FW side with ideep side, this PR will fix this gap which ideep will use the format flag given by FW side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83653 Approved by: https://github.com/mingfeima, https://github.com/malfet	2022-08-25 03:58:11 +00:00
PyTorch MergeBot	a7edf71360	Revert "Don't introduce new overload for SymInt (#83628 )" This reverts commit `8fae7027b3`. Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222	2022-08-25 00:49:40 +00:00
kshitij12345	7a8152530d	move pooling test from test_nn to test/nn/test_pooling (#83915 ) Ref #63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915 Approved by: https://github.com/albanD	2022-08-24 16:17:50 +00:00
Ishan-Rajgarhia	7fdc2f70c6	Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870 ) See https://github.com/pytorch/pytorch/issues/38095 Replaced assertEqualIgnoreType with assertEqual Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870 Approved by: https://github.com/kit1980	2022-08-24 02:45:52 +00:00
Edward Z. Yang	8fae7027b3	Don't introduce new overload for SymInt (#83628 ) Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented. This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts. This is BC-breaking in the following ways: * The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this. This is not BC-breaking in the following ways: * The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed. * This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type. Structure of the PR: * The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it as if it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other: * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular: * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences. * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!) * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway. * Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes. * The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK. * I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it. * I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload) * I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.) * I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints. * I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-08-23 22:04:07 +00:00
Khushi Agrawal	9095030239	[fix] edge case in `MaxPool1d` and add ErrorInputs (#83553 ) Fixes #83224 cc @kshitij12345 @albanD! Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553 Approved by: https://github.com/albanD	2022-08-23 19:23:39 +00:00
Kshiteej K	dd67d52b57	[nn] split rnn_utils test from test_nn.py (#83675 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Proposed folder structure ``` -> test -> nn -> test_conv.py -> test_pooling.py -> ..... ``` This PR: Moves test related RNN utilities to a different file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675 Approved by: https://github.com/albanD	2022-08-23 08:34:39 +00:00
XiaobingSuper	658f958bc4	fix upsample bf16 issue for channels last path by using high pricsion to compute index (#83847 ) Given the following case: ``` import torch a = torch.ones(1, 3, 320, 480).bfloat16().to(memory_format=torch.channels_last) out_bf16 = torch.nn.functional.interpolate(a, size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False) out_fp32= torch.nn.functional.interpolate(a.float(), size = (640, 960), scale_factor = None, mode = 'bilinear', align_corners = False, recompute_scale_factor= None, antialias = False) print(out_bf16[0, 2, :, :]) print(out_fp32[0, 2, :, :]) ``` the boundary of bfloat16 output gets a wrong value: ``` tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], ..., [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [1.0000e+00, 1.0000e+00, 1.0000e+00, ..., 1.0000e+00, 1.0000e+00, 1.0000e+00], [0.0000e+00, 0.0000e+00, 1.8367e-40, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00]], dtype=torch.bfloat16) tensor([[1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], ..., [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.]]) ``` the expected behavior is that the bfloat16 output value should also be one. The main reason is that we use low precision to compute the index, see `fcb124406b/aten/src/ATen/native/UpSample.h (L448)`, we should use a high precison to do the computation as GPU path: `fcb124406b/aten/src/ATen/native/cuda/UpSample.cuh (L123)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83847 Approved by: https://github.com/frank-wei	2022-08-23 00:53:37 +00:00
PyTorch MergeBot	4cbb1986fe	Revert "[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 )" This reverts commit `7cd2fa1d38`. Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted	2022-08-22 07:23:24 +00:00
zaf	7cd2fa1d38	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-22 05:33:23 +00:00
Rui Zhu	e0f2eba93d	Move odd num_head in TransformerEncoder to slow_path (#83483 ) Summary: odd nhead is not supported for masked softmax, therefore we just move it to use old slow_path Test Plan: CI Differential Revision: D38720086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83483 Approved by: https://github.com/erichan1	2022-08-20 10:02:08 +00:00
Jeff Daily	d52d2bd5a9	[ROCm] MIOpen fused convolution relu (#82002 ) Adds MIOpen fused convolution relu for fp32 and contiguous memory format. Adds fallbacks for conv + z + bias + relu, fp16, and channels last until MIOpen adds these features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82002 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-08-16 20:49:33 +00:00
Nicolas Macchioni	b236352036	Add mask identifier for multiplexed src_mask/src_key_padding_mask in BT (#81947 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947 Transformer fastpath multiplexes two arguments, src_mask [seq_len x seq_len] and src_key_padding_mask [batch_size x seq_len], and later deduces the type based on mask shape. In the event that batch_size == seq_len, any src_mask is wrongly interpreted as a src_key padding_mask. This is fixed by requiring a mask_type identifier be supplied whenever batch_size == seq_len. Additionally, added support for src_mask in masked_softmax CPU path. Test Plan: existing unit tests + new unit tests (batch_size == seq_len) Differential Revision: D37932240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81947 Approved by: https://github.com/zrphercule	2022-08-09 23:42:16 +00:00
Sergii Dymchenko	7390ae837c	Resolve TODO for GroupNorm numerical issues (#82423 ) Looks like the numerical issues are resolved now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82423 Approved by: https://github.com/ngimel	2022-08-03 19:42:26 +00:00
Jiayi Sun	15a284b09e	optimize softmax backward and logsoftmax backward (#80114 ) Currently, if we run softmax_backward/logsoftmax_backward which are not along the last dim, the calculation will fall to a [scalar version](`32593ef2dd/aten/src/ATen/native/SoftMax.cpp (L220-L287)`). And we find actually we have the chance to vectorize the calculation along the inner_size dim. Changes we made: Use vectorized softmax_backward_kernel/log_softmax_backward_kernel instead of host_softmax_backward when not along the last dim. We collected the benchmark data of softmax_backward and logsoftmax_backward for BFloat16 and Float32 data type by using the operator_benchmark tool of PyTorch on the platform of Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz. Number of cores: 24 cores(1 socket) [softmax_benchmark_32593ef.log](https://github.com/pytorch/pytorch/files/8962956/softmax_benchmark_32593ef.log) [softmax_benchmark_the_pr.log](https://github.com/pytorch/pytorch/files/8962958/softmax_benchmark_the_pr.log) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80114 Approved by: https://github.com/frank-wei	2022-08-03 00:36:28 +00:00
mingfeima	b019a41674	fix bug for thnn_conv2d when input's C is 1 and weight is channels last (#82392 ) To fix https://github.com/pytorch/pytorch/issues/82060 When `input` is not explicitly converted to channels last while `conv` has, the output should also be in channels last. The root cause is that when input has IC of 1, `compute_columns2d` from `\aten\src\ATen\native\ConvolutionMM2d.cpp` would consider it as channels first: We do have logic to make sure both input and weight have the same memory format even if they are given differently, like: ``` auto input = self.contiguous(memory_format); auto weight = weight_.contiguous(memory_format); ``` But for a N1HW input, `.contiguous(MemoryFormat::ChannelsLast)` would not change its stride , and its `suggest_memory_format()` still returns `MemoryFormat::Contiguous`. That's how it went wrong. Also updated the corresponding test cases, without this patch, the new test case would fail on forward path and runtime error on backward path. attach old fail log on forward path: ``` FAIL: test_conv_thnn_nhwc_cpu_float32 (__main__.TestNNDeviceTypeCPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test result = test(self, *param_kwargs) File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 974, in only_fn return fn(slf, args, **kwargs) File "test/test_nn.py", line 19487, in test_conv_thnn_nhwc input_format=torch.contiguous_format, weight_format=torch.channels_last) File "test/test_nn.py", line 19469, in helper self.assertEqual(out, ref_out, exact_dtype=False) File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 2376, in assertEqual msg=(lambda generated_msg: f"{generated_msg} : {msg}") if isinstance(msg, str) and self.longMessage else msg, File "/home/mingfeim/anaconda3/envs/pytorch-test-cpu/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal raise error_metas[0].to_error(msg) AssertionError: Tensor-likes are not close! Mismatched elements: 988 / 1024 (96.5%) Greatest absolute difference: 42.0 at index (1, 2, 6, 6) (up to 1e-05 allowed) Greatest relative difference: inf at index (0, 0, 2, 1) (up to 1.3e-06 allowed) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/82392 Approved by: https://github.com/jbschlosser	2022-07-28 14:20:52 +00:00
Khushi Agrawal	050aec1805	[nn] add `pop` to sequential and ModuleList (#81601 ) Follows #71329 cc @kshitij12345! Pull Request resolved: https://github.com/pytorch/pytorch/pull/81601 Approved by: https://github.com/albanD	2022-07-25 19:32:32 +00:00
Ansh Radhakrishnan	110cd724fc	[nn] Add support for +=, * and *= operations for nn.Sequential objects (#81279 ) Fixes 71329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81279 Approved by: https://github.com/albanD	2022-07-25 15:48:47 +00:00
soulitzer	f595467e5c	Reenable slow gradcheck and make it pass (#80514 ) Context: For a while slow gradcheck CI was skipping nearly all tests and this hid the fact that it should've been failing and timing out (10+h runtime for TestGradients). The CI configuration has since been fixed to correct this, revealing the test failures. This PR reenables slow gradcheck CI and makes it pass again. This PR: - makes slow and failing tests run in fast gradcheck mode only - reduce the input size for slow gradcheck only for unary/binary ufuncs (alternatively, skip the test entirely) - skip entire test files on slow gradcheck runner if they don't use gradcheck (test_ops, test_meta, test_decomp, test_ops_jit) - reduces the input size for some ops Follow ups: 1. Investigate slow mode failures https://github.com/pytorch/pytorch/issues/80411 2. See if we can re-enable slow gradcheck tests for some of the slow tests by reducing the sizes of their inputs The following are failing in slow mode, they are now running in fast mode only. ``` test_fn_fwgrad_bwgrad___rmod___cuda_float64 test_fn_fwgrad_bwgrad_linalg_householder_product_cuda_complex128 test_fn_fwgrad_bwgrad__masked_prod_cuda_complex128 test_fn_fwgrad_bwgrad__masked_prod_cuda_float64 test_fn_fwgrad_bwgrad_linalg_matrix_power_cuda_complex128 test_fn_fwgrad_bwgrad_cat_cuda_complex128 test_fn_fwgrad_bwgrad_linalg_lu_factor_ex_cuda_float64 test_fn_fwgrad_bwgrad_copysign_cuda_float64 test_fn_fwgrad_bwgrad_cholesky_inverse_cuda_complex128 test_fn_fwgrad_bwgrad_float_power_cuda_complex128 test_fn_fwgrad_bwgrad_fmod_cuda_float64 test_fn_fwgrad_bwgrad_float_power_cuda_float64 test_fn_fwgrad_bwgrad_linalg_lu_cuda_float64 test_fn_fwgrad_bwgrad_remainder_cuda_float64 test_fn_fwgrad_bwgrad_repeat_cuda_complex128 test_fn_fwgrad_bwgrad_prod_cuda_complex128 test_fn_fwgrad_bwgrad_slice_scatter_cuda_float64 test_fn_fwgrad_bwgrad_tile_cuda_complex128 test_fn_fwgrad_bwgrad_pow_cuda_float64 test_fn_fwgrad_bwgrad_pow_cuda_complex128 test_fn_fwgrad_bwgrad_fft_* test_fn_fwgrad_bwgrad_zero__cuda_complex128 test_fn_gradgrad_linalg_lu_factor_cuda_float64 test_fn_grad_div_trunc_rounding_cuda_float64 test_fn_grad_div_floor_rounding_cuda_float64 ``` Marks the OpInfos for the following ops that run slowly in slow gradcheck as `fast_gradcheck` only (the left column represents runtime in seconds): ``` 0 918.722 test_fn_fwgrad_bwgrad_nn_functional_conv_transpose3d_cuda_float64 1 795.042 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_complex128 2 583.63 test_fn_fwgrad_bwgrad_nn_functional_max_pool3d_cuda_float64 3 516.946 test_fn_fwgrad_bwgrad_svd_cuda_complex128 4 503.179 test_fn_fwgrad_bwgrad_linalg_svd_cuda_complex128 5 460.985 test_fn_fwgrad_bwgrad_linalg_lu_cuda_complex128 6 401.04 test_fn_fwgrad_bwgrad_linalg_lstsq_grad_oriented_cuda_complex128 7 353.671 test_fn_fwgrad_bwgrad_nn_functional_max_pool2d_cuda_float64 8 321.903 test_fn_fwgrad_bwgrad_nn_functional_gaussian_nll_loss_cuda_float64 9 307.951 test_fn_fwgrad_bwgrad_stft_cuda_complex128 10 266.104 test_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64 11 221.032 test_fn_fwgrad_bwgrad_istft_cuda_complex128 12 183.741 test_fn_fwgrad_bwgrad_lu_unpack_cuda_complex128 13 132.019 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_float64 14 125.343 test_fn_fwgrad_bwgrad_nn_functional_pad_constant_cuda_complex128 15 124.2 test_fn_fwgrad_bwgrad_kron_cuda_complex128 16 123.721 test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64 17 121.074 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64 18 119.387 test_fn_fwgrad_bwgrad_rot90_cuda_complex128 19 112.889 test_fn_fwgrad_bwgrad__masked_normalize_cuda_complex128 20 107.541 test_fn_fwgrad_bwgrad_dist_cuda_complex128 21 106.727 test_fn_fwgrad_bwgrad_diff_cuda_complex128 22 104.588 test_fn_fwgrad_bwgrad__masked_cumprod_cuda_complex128 23 100.135 test_fn_fwgrad_bwgrad_nn_functional_feature_alpha_dropout_with_train_cuda_float64 24 88.359 test_fn_fwgrad_bwgrad_mH_cuda_complex128 25 86.214 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64 26 83.037 test_fn_fwgrad_bwgrad_nn_functional_bilinear_cuda_float64 27 79.987 test_fn_fwgrad_bwgrad__masked_cumsum_cuda_complex128 28 77.822 test_fn_fwgrad_bwgrad_diag_embed_cuda_complex128 29 76.256 test_fn_fwgrad_bwgrad_mT_cuda_complex128 30 74.039 test_fn_fwgrad_bwgrad_linalg_lu_solve_cuda_complex128 ``` ``` 0 334.142 test_fn_fwgrad_bwgrad_unfold_cuda_complex128 1 312.791 test_fn_fwgrad_bwgrad_linalg_lu_factor_cuda_complex128 2 121.963 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64 3 108.085 test_fn_fwgrad_bwgrad_diff_cuda_complex128 4 89.418 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64 5 72.231 test_fn_fwgrad_bwgrad___rdiv___cuda_complex128 6 69.433 test_fn_fwgrad_bwgrad___getitem___cuda_complex128 7 68.582 test_fn_fwgrad_bwgrad_ldexp_cuda_complex128 8 68.572 test_fn_fwgrad_bwgrad_linalg_pinv_cuda_complex128 9 67.585 test_fn_fwgrad_bwgrad_nn_functional_glu_cuda_float64 10 66.567 test_fn_fwgrad_bwgrad_lu_cuda_float64 ``` ``` 0 630.13 test_fn_gradgrad_nn_functional_conv2d_cuda_complex128 1 81.086 test_fn_gradgrad_linalg_solve_triangular_cuda_complex128 2 71.332 test_fn_gradgrad_norm_cuda_complex128 3 64.308 test_fn_gradgrad__masked_std_cuda_complex128 4 59.519 test_fn_gradgrad_div_no_rounding_mode_cuda_complex128 5 58.836 test_fn_gradgrad_nn_functional_adaptive_avg_pool3 ``` Reduces the sizes of the inputs for: - diff - diag_embed Pull Request resolved: https://github.com/pytorch/pytorch/pull/80514 Approved by: https://github.com/albanD	2022-07-22 02:05:37 +00:00
Saketh Are	445ee5620e	Simplify torch.nn.grad by calling into aten::convolution_backward (#81839 ) `torch.nn.grad` has its own implementations of gradients for conv1d, conv2d, and conv3d. This PR simplifies them by calling into the unified `aten::convolution_backward` backend instead. The existing implementation of conv2d_weight is incorrect for some inputs (see issue #51430). This PR fixes the issue. This PR expands coverage in test_nn to include conv1d_weight, conv2d_weight, and conv3d_weight, which were previously untested. It also expands the cases for conv2d to cover issue #51430. Fixes #51430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81839 Approved by: https://github.com/albanD	2022-07-21 19:34:27 +00:00
Khushi Agrawal	dced803339	[nn] add `insert` method to sequential class (#81402 ) Follows #71329 cc @kshitij12345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81402 Approved by: https://github.com/albanD	2022-07-20 14:45:52 +00:00
Khushi Agrawal	2c0b11b43b	[nn] implement `extend` method to sequential class (#81179 ) Follows #71329 cc @kshitij12345 :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/81179 Approved by: https://github.com/albanD	2022-07-20 05:33:41 +00:00
PyTorch MergeBot	f82b19f15b	Revert "Disable use_mkldnn when input is not contiguous for oneDNN (#80864 )" This reverts commit `4655c3bace`. Reverted https://github.com/pytorch/pytorch/pull/80864 on behalf of https://github.com/janeyx99 due to Reverting due for a perf regression https://github.com/pytorch/benchmark/issues/1040	2022-07-19 18:58:52 +00:00
yanbing-j	4655c3bace	Disable use_mkldnn when input is not contiguous for oneDNN (#80864 ) Fixes [#80837](https://github.com/pytorch/pytorch/issues/80837). This PR is to disable use_mkldnn when input is not contiguous for oneDNN requirement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80864 Approved by: https://github.com/malfet	2022-07-17 14:58:26 +00:00
Rui Zhu	b22166fd62	Add a small fastpath test for native mha (#81432 ) Summary: We dont have a small fast path passing test for mha before, this diff added one for better testing Test Plan: buck build mode/dev-nosan -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true caffe2/test:nn && buck-out/dev/gen/caffe2/test/nn\#binary.par -r test_multihead_attn_fast_path_small_test Differential Revision: D37834319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81432 Approved by: https://github.com/erichan1	2022-07-15 23:54:40 +00:00
Eric Han	23088fcfdf	disable src mask for transformer and multiheadattention fastpath (#81277 ) Disable fastpath if src_mask passed to TransformerEncoderLayer and MultiheadAttention. - Refactored test_transformerencoder from test_nn.py to test_transformers.py. Added a src_mask test there. - Added a specific src_mask test in test_transformers.py Fixes https://github.com/pytorch/pytorch/issues/81129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81277 Approved by: https://github.com/zrphercule	2022-07-15 20:55:17 +00:00
n.zhuravlev	7af0200a46	Add deepcopy functionality to parametrized modules (#80811 ) Fixes #69413 After applying parametrization to any `nn.Module` we lose the ability to create a deepcopy of it e.g. it makes it impossible to wrap a module by an `AveragedModel`. Specifically, the problem is that the `deepcopy` tries to invoke `__getstate__` if object hasn't implemented its own `__deepcopy__` magic method. But we don't allow serialization of the parametrized modules: `__getstate__` raises an error. My solution is just to create a default `__deepcopy__` method when it doesn't exist yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80811 Approved by: https://github.com/pearu, https://github.com/albanD	2022-07-15 09:06:45 +00:00
Khushi Agrawal	3da8c909da	[nn] add `+` operator for torch.nn.Sequential to concatenate (#81170 ) Fixes #78512 #### TODO - [x] add tests cc @kshitij12345! Pull Request resolved: https://github.com/pytorch/pytorch/pull/81170 Approved by: https://github.com/albanD	2022-07-11 17:49:58 +00:00
eqy	3b78c5682b	Don't implicitly convert to channels-first in MaxPool3D on CUDA (#80748 ) MaxPool3D currently converts inputs implicitly to channels-first (via `.contiguous()`) which may yield unexpected regressions in workloads that expect a full channels-last path. This PR preserves the channels-last format in MaxPool3D while attempting to avoid seriously regressing performance. Currently, typical case (kernel size == 2 == stride) looks good, but larger kernel sizes (>4) or the unusual case of stride 1 can sometimes be slower than converting to channels-first before doing MaxPool3D. Additionally, this PR adds a test for 64bit-indexing backwards as testing of these changes uncovered an IMA for large tensors when doing the backwards pass with MaxPool3D. Performance comparison on A6000: ``` [------------------------------------- max_pool3d ---------------------------------------------------------] \| channels_last=False \| curr ch_last=True \| new ch_last=True 1 threads: ---------------------------------------------------------------------------- --------------------- [64, 256, 32, 32, 32] 4x4 stride 4 \| 20093.5 \| 34823.4 \| 20640.0 [64, 256, 32, 32, 32] 4x4 stride 2 \| 28623.7 \| 42625.6 \| 27935.5 [64, 256, 32, 32, 32] 4x4 stride 1 \| 68177.5 \| 79147.2 \| 85604.8 [64, 256, 32, 32, 32] 2x2 stride 4 \| 17237.7 \| 32071.3 \| 16641.6 [64, 256, 32, 32, 32] 2x2 stride 2 \| 25252.5 \| 39993.2 \| 25054.8 [64, 256, 32, 32, 32] 2x2 stride 1 \| 43185.2 \| 58164.6 \| 48416.9 [64, 256, 16, 16, 16] 4x4 stride 4 \| 3017.7 \| 3952.4 \| 2593.8 [64, 256, 16, 16, 16] 4x4 stride 2 \| 4581.5 \| 5384.3 \| 3294.3 [64, 256, 16, 16, 16] 4x4 stride 1 \| 11334.1 \| 11534.7 \| 8651.1 [64, 256, 16, 16, 16] 2x2 stride 4 \| 2346.9 \| 3304.6 \| 2098.8 [64, 256, 16, 16, 16] 2x2 stride 2 \| 3550.8 \| 4526.5 \| 3143.6 [64, 256, 16, 16, 16] 2x2 stride 1 \| 6898.1 \| 7816.0 \| 5820.8 [64, 256, 4, 4, 4] 4x4 stride 4 \| 191.5 \| 176.3 \| 77.5 [64, 256, 4, 4, 4] 4x4 stride 2 \| 191.8 \| 176.8 \| 94.1 [64, 256, 4, 4, 4] 4x4 stride 1 \| 191.3 \| 176.4 \| 97.3 [64, 256, 4, 4, 4] 2x2 stride 4 \| 96.4 \| 114.4 \| 93.6 [64, 256, 4, 4, 4] 2x2 stride 2 \| 172.1 \| 178.6 \| 93.7 [64, 256, 4, 4, 4] 2x2 stride 1 \| 263.0 \| 279.4 \| 92.4 [64, 64, 32, 32, 32] 4x4 stride 4 \| 5033.2 \| 7208.3 \| 5167.5 [64, 64, 32, 32, 32] 4x4 stride 2 \| 7216.1 \| 9218.7 \| 6637.1 [64, 64, 32, 32, 32] 4x4 stride 1 \| 17192.1 \| 18392.9 \| 20489.0 [64, 64, 32, 32, 32] 2x2 stride 4 \| 4318.0 \| 6511.2 \| 4193.1 [64, 64, 32, 32, 32] 2x2 stride 2 \| 6324.4 \| 8657.7 \| 6263.6 [64, 64, 32, 32, 32] 2x2 stride 1 \| 10855.0 \| 13040.2 \| 12055.9 [64, 64, 16, 16, 16] 4x4 stride 4 \| 764.1 \| 975.6 \| 671.3 [64, 64, 16, 16, 16] 4x4 stride 2 \| 1163.1 \| 1333.4 \| 833.6 [64, 64, 16, 16, 16] 4x4 stride 1 \| 2890.0 \| 2898.5 \| 2209.8 [64, 64, 16, 16, 16] 2x2 stride 4 \| 593.5 \| 811.2 \| 536.3 [64, 64, 16, 16, 16] 2x2 stride 2 \| 895.9 \| 1112.3 \| 794.5 [64, 64, 16, 16, 16] 2x2 stride 1 \| 1742.5 \| 1968.0 \| 1475.2 [64, 64, 4, 4, 4] 4x4 stride 4 \| 101.1 \| 112.2 \| 93.4 [64, 64, 4, 4, 4] 4x4 stride 2 \| 96.7 \| 114.6 \| 92.5 [64, 64, 4, 4, 4] 4x4 stride 1 \| 98.9 \| 111.9 \| 96.5 [64, 64, 4, 4, 4] 2x2 stride 4 \| 100.1 \| 107.1 \| 94.2 [64, 64, 4, 4, 4] 2x2 stride 2 \| 96.6 \| 108.0 \| 94.5 [64, 64, 4, 4, 4] 2x2 stride 1 \| 96.7 \| 107.9 \| 95.2 [64, 3, 32, 32, 32] 4x4 stride 4 \| 250.1 \| 326.6 \| 278.0 [64, 3, 32, 32, 32] 4x4 stride 2 \| 350.4 \| 414.0 \| 323.2 [64, 3, 32, 32, 32] 4x4 stride 1 \| 825.6 \| 846.9 \| 982.5 [64, 3, 32, 32, 32] 2x2 stride 4 \| 213.3 \| 289.8 \| 219.9 [64, 3, 32, 32, 32] 2x2 stride 2 \| 308.2 \| 384.9 \| 305.9 [64, 3, 32, 32, 32] 2x2 stride 1 \| 523.5 \| 594.7 \| 589.9 [64, 3, 16, 16, 16] 4x4 stride 4 \| 103.8 \| 116.7 \| 93.0 [64, 3, 16, 16, 16] 4x4 stride 2 \| 100.9 \| 108.3 \| 93.3 [64, 3, 16, 16, 16] 4x4 stride 1 \| 139.4 \| 140.7 \| 104.8 [64, 3, 16, 16, 16] 2x2 stride 4 \| 97.5 \| 114.7 \| 92.7 [64, 3, 16, 16, 16] 2x2 stride 2 \| 97.4 \| 108.8 \| 91.7 [64, 3, 16, 16, 16] 2x2 stride 1 \| 99.9 \| 108.0 \| 94.1 [64, 3, 4, 4, 4] 4x4 stride 4 \| 97.2 \| 110.2 \| 94.7 [64, 3, 4, 4, 4] 4x4 stride 2 \| 105.7 \| 107.4 \| 92.8 [64, 3, 4, 4, 4] 4x4 stride 1 \| 98.0 \| 110.0 \| 93.7 [64, 3, 4, 4, 4] 2x2 stride 4 \| 98.3 \| 116.7 \| 93.0 [64, 3, 4, 4, 4] 2x2 stride 2 \| 98.6 \| 107.5 \| 92.8 [64, 3, 4, 4, 4] 2x2 stride 1 \| 100.6 \| 110.3 \| 94.0 [16, 256, 32, 32, 32] 4x4 stride 4 \| 5034.2 \| 8838.0 \| 5165.9 [16, 256, 32, 32, 32] 4x4 stride 2 \| 7236.3 \| 10869.9 \| 7038.2 [16, 256, 32, 32, 32] 4x4 stride 1 \| 17385.4 \| 21401.6 \| 21900.7 [16, 256, 32, 32, 32] 2x2 stride 4 \| 4318.7 \| 8101.2 \| 4172.9 [16, 256, 32, 32, 32] 2x2 stride 2 \| 6324.0 \| 10147.5 \| 6279.7 [16, 256, 32, 32, 32] 2x2 stride 1 \| 10899.7 \| 14826.0 \| 12256.3 [16, 256, 16, 16, 16] 4x4 stride 4 \| 765.4 \| 1012.7 \| 675.6 [16, 256, 16, 16, 16] 4x4 stride 2 \| 1162.8 \| 1376.9 \| 843.4 [16, 256, 16, 16, 16] 4x4 stride 1 \| 2928.9 \| 2969.8 \| 2222.5 [16, 256, 16, 16, 16] 2x2 stride 4 \| 593.5 \| 845.8 \| 534.2 [16, 256, 16, 16, 16] 2x2 stride 2 \| 896.9 \| 1152.2 \| 796.9 [16, 256, 16, 16, 16] 2x2 stride 1 \| 1750.2 \| 2009.4 \| 1481.8 [16, 256, 4, 4, 4] 4x4 stride 4 \| 96.6 \| 107.1 \| 92.7 [16, 256, 4, 4, 4] 4x4 stride 2 \| 97.9 \| 114.9 \| 93.8 [16, 256, 4, 4, 4] 4x4 stride 1 \| 98.2 \| 115.6 \| 94.0 [16, 256, 4, 4, 4] 2x2 stride 4 \| 97.0 \| 106.7 \| 93.8 [16, 256, 4, 4, 4] 2x2 stride 2 \| 96.8 \| 108.1 \| 93.3 [16, 256, 4, 4, 4] 2x2 stride 1 \| 95.8 \| 120.9 \| 95.7 [16, 64, 32, 32, 32] 4x4 stride 4 \| 1266.4 \| 1815.4 \| 1312.3 [16, 64, 32, 32, 32] 4x4 stride 2 \| 1818.5 \| 2328.0 \| 1678.9 [16, 64, 32, 32, 32] 4x4 stride 1 \| 4352.9 \| 4649.3 \| 5204.6 [16, 64, 32, 32, 32] 2x2 stride 4 \| 1090.0 \| 1631.2 \| 1060.8 [16, 64, 32, 32, 32] 2x2 stride 2 \| 1589.4 \| 2141.1 \| 1576.4 [16, 64, 32, 32, 32] 2x2 stride 1 \| 2733.5 \| 3286.0 \| 3041.6 [16, 64, 16, 16, 16] 4x4 stride 4 \| 201.7 \| 259.6 \| 175.0 [16, 64, 16, 16, 16] 4x4 stride 2 \| 301.0 \| 350.1 \| 226.3 [16, 64, 16, 16, 16] 4x4 stride 1 \| 740.1 \| 748.7 \| 570.6 [16, 64, 16, 16, 16] 2x2 stride 4 \| 156.0 \| 214.8 \| 140.8 [16, 64, 16, 16, 16] 2x2 stride 2 \| 232.3 \| 292.3 \| 208.7 [16, 64, 16, 16, 16] 2x2 stride 1 \| 449.1 \| 504.0 \| 382.1 [16, 64, 4, 4, 4] 4x4 stride 4 \| 97.5 \| 111.4 \| 94.5 [16, 64, 4, 4, 4] 4x4 stride 2 \| 98.8 \| 111.9 \| 94.4 [16, 64, 4, 4, 4] 4x4 stride 1 \| 98.2 \| 112.0 \| 95.2 [16, 64, 4, 4, 4] 2x2 stride 4 \| 99.7 \| 111.0 \| 94.0 [16, 64, 4, 4, 4] 2x2 stride 2 \| 100.3 \| 110.0 \| 93.2 [16, 64, 4, 4, 4] 2x2 stride 1 \| 97.5 \| 107.6 \| 93.5 [16, 3, 32, 32, 32] 4x4 stride 4 \| 100.5 \| 117.1 \| 95.7 [16, 3, 32, 32, 32] 4x4 stride 2 \| 97.5 \| 121.3 \| 92.5 [16, 3, 32, 32, 32] 4x4 stride 1 \| 216.0 \| 227.4 \| 258.4 [16, 3, 32, 32, 32] 2x2 stride 4 \| 97.1 \| 109.0 \| 91.9 [16, 3, 32, 32, 32] 2x2 stride 2 \| 95.8 \| 108.5 \| 92.9 [16, 3, 32, 32, 32] 2x2 stride 1 \| 139.4 \| 161.2 \| 157.8 [16, 3, 16, 16, 16] 4x4 stride 4 \| 96.4 \| 113.6 \| 91.9 [16, 3, 16, 16, 16] 4x4 stride 2 \| 97.4 \| 108.1 \| 93.5 [16, 3, 16, 16, 16] 4x4 stride 1 \| 99.0 \| 107.5 \| 92.1 [16, 3, 16, 16, 16] 2x2 stride 4 \| 96.9 \| 118.1 \| 93.4 [16, 3, 16, 16, 16] 2x2 stride 2 \| 97.3 \| 106.7 \| 95.8 [16, 3, 16, 16, 16] 2x2 stride 1 \| 98.8 \| 109.2 \| 93.8 [16, 3, 4, 4, 4] 4x4 stride 4 \| 97.8 \| 108.0 \| 94.2 [16, 3, 4, 4, 4] 4x4 stride 2 \| 92.7 \| 108.0 \| 93.9 [16, 3, 4, 4, 4] 4x4 stride 1 \| 97.8 \| 107.6 \| 93.5 [16, 3, 4, 4, 4] 2x2 stride 4 \| 100.3 \| 107.7 \| 94.3 [16, 3, 4, 4, 4] 2x2 stride 2 \| 97.2 \| 107.5 \| 96.1 [16, 3, 4, 4, 4] 2x2 stride 1 \| 98.1 \| 111.1 \| 93.8 Times are in microseconds (us). ``` Performance comparison on V100: (these times have been updated after working around some noisy measurements in my setup) ``` [------------------------------------- max_pool3d ---------------------------------------------------------] \| channels_last=False \| curr ch_last=True \| new ch_last=True 1 threads: ------------------------------------------------------------------------------------------------- [64, 256, 32, 32, 32] 4x4 stride 4 \| 15810.7 \| 33807.7 \| 16452.9 [64, 256, 32, 32, 32] 4x4 stride 2 \| 24422.7 \| 42515.3 \| 27700.3 [64, 256, 32, 32, 32] 4x4 stride 1 \| 71756.0 \| 89916.5 \| 106464.0 [64, 256, 32, 32, 32] 2x2 stride 4 \| 12102.9 \| 30210.4 \| 11319.8 [64, 256, 32, 32, 32] 2x2 stride 2 \| 19101.7 \| 37210.8 \| 20373.3 [64, 256, 32, 32, 32] 2x2 stride 1 \| 41418.0 \| 59650.5 \| 53009.2 [64, 256, 16, 16, 16] 4x4 stride 4 \| 2362.0 \| 4210.3 \| 2114.0 [64, 256, 16, 16, 16] 4x4 stride 2 \| 4102.4 \| 5897.4 \| 3179.7 [64, 256, 16, 16, 16] 4x4 stride 1 \| 11339.3 \| 13116.6 \| 10032.6 [64, 256, 16, 16, 16] 2x2 stride 4 \| 1709.7 \| 3506.7 \| 1423.6 [64, 256, 16, 16, 16] 2x2 stride 2 \| 2966.6 \| 4760.8 \| 2499.3 [64, 256, 16, 16, 16] 2x2 stride 1 \| 6998.4 \| 8797.3 \| 6152.0 [64, 256, 4, 4, 4] 4x4 stride 4 \| 173.0 \| 176.3 \| 127.9 [64, 256, 4, 4, 4] 4x4 stride 2 \| 149.1 \| 176.3 \| 125.5 [64, 256, 4, 4, 4] 4x4 stride 1 \| 150.0 \| 177.2 \| 125.6 [64, 256, 4, 4, 4] 2x2 stride 4 \| 158.0 \| 192.7 \| 127.9 [64, 256, 4, 4, 4] 2x2 stride 2 \| 169.7 \| 199.2 \| 125.3 [64, 256, 4, 4, 4] 2x2 stride 1 \| 289.6 \| 318.2 \| 116.5 [64, 64, 32, 32, 32] 4x4 stride 4 \| 3914.4 \| 6993.3 \| 4141.4 [64, 64, 32, 32, 32] 4x4 stride 2 \| 6107.4 \| 9186.4 \| 6378.5 [64, 64, 32, 32, 32] 4x4 stride 1 \| 17920.0 \| 20993.5 \| 23891.1 [64, 64, 32, 32, 32] 2x2 stride 4 \| 3029.7 \| 6112.6 \| 2895.6 [64, 64, 32, 32, 32] 2x2 stride 2 \| 4787.8 \| 7870.6 \| 4724.8 [64, 64, 32, 32, 32] 2x2 stride 1 \| 10366.4 \| 13446.4 \| 12603.8 [64, 64, 16, 16, 16] 4x4 stride 4 \| 605.8 \| 962.9 \| 499.7 [64, 64, 16, 16, 16] 4x4 stride 2 \| 1037.0 \| 1394.8 \| 791.6 [64, 64, 16, 16, 16] 4x4 stride 1 \| 2835.4 \| 3191.8 \| 2484.3 [64, 64, 16, 16, 16] 2x2 stride 4 \| 438.6 \| 795.7 \| 368.6 [64, 64, 16, 16, 16] 2x2 stride 2 \| 749.1 \| 1108.0 \| 612.0 [64, 64, 16, 16, 16] 2x2 stride 1 \| 1756.4 \| 2112.2 \| 1538.5 [64, 64, 4, 4, 4] 4x4 stride 4 \| 132.6 \| 163.9 \| 115.4 [64, 64, 4, 4, 4] 4x4 stride 2 \| 129.3 \| 153.7 \| 117.8 [64, 64, 4, 4, 4] 4x4 stride 1 \| 128.0 \| 153.8 \| 117.6 [64, 64, 4, 4, 4] 2x2 stride 4 \| 128.2 \| 154.1 \| 117.5 [64, 64, 4, 4, 4] 2x2 stride 2 \| 130.5 \| 157.3 \| 117.6 [64, 64, 4, 4, 4] 2x2 stride 1 \| 128.8 \| 156.4 \| 120.6 [64, 3, 32, 32, 32] 4x4 stride 4 \| 200.4 \| 261.0 \| 228.8 [64, 3, 32, 32, 32] 4x4 stride 2 \| 305.3 \| 366.5 \| 344.4 [64, 3, 32, 32, 32] 4x4 stride 1 \| 860.9 \| 922.1 \| 1136.0 [64, 3, 32, 32, 32] 2x2 stride 4 \| 157.0 \| 216.9 \| 158.1 [64, 3, 32, 32, 32] 2x2 stride 2 \| 240.5 \| 300.9 \| 247.7 [64, 3, 32, 32, 32] 2x2 stride 1 \| 503.5 \| 565.1 \| 609.8 [64, 3, 16, 16, 16] 4x4 stride 4 \| 136.0 \| 159.0 \| 120.3 [64, 3, 16, 16, 16] 4x4 stride 2 \| 131.2 \| 156.9 \| 120.0 [64, 3, 16, 16, 16] 4x4 stride 1 \| 146.6 \| 158.5 \| 123.8 [64, 3, 16, 16, 16] 2x2 stride 4 \| 133.8 \| 158.4 \| 117.1 [64, 3, 16, 16, 16] 2x2 stride 2 \| 132.1 \| 160.8 \| 117.9 [64, 3, 16, 16, 16] 2x2 stride 1 \| 133.7 \| 174.4 \| 118.0 [64, 3, 4, 4, 4] 4x4 stride 4 \| 156.8 \| 166.2 \| 119.4 [64, 3, 4, 4, 4] 4x4 stride 2 \| 126.8 \| 150.4 \| 118.2 [64, 3, 4, 4, 4] 4x4 stride 1 \| 125.2 \| 151.7 \| 117.8 [64, 3, 4, 4, 4] 2x2 stride 4 \| 127.3 \| 152.7 \| 116.2 [64, 3, 4, 4, 4] 2x2 stride 2 \| 128.6 \| 153.3 \| 114.6 [64, 3, 4, 4, 4] 2x2 stride 1 \| 128.6 \| 153.5 \| 114.7 [16, 256, 32, 32, 32] 4x4 stride 4 \| 3921.7 \| 8445.7 \| 4064.7 [16, 256, 32, 32, 32] 4x4 stride 2 \| 6111.7 \| 10630.0 \| 6944.4 [16, 256, 32, 32, 32] 4x4 stride 1 \| 17938.9 \| 22896.8 \| 26648.7 [16, 256, 32, 32, 32] 2x2 stride 4 \| 3029.6 \| 7552.7 \| 2840.9 [16, 256, 32, 32, 32] 2x2 stride 2 \| 4788.0 \| 9322.1 \| 5110.5 [16, 256, 32, 32, 32] 2x2 stride 1 \| 10363.7 \| 14885.9 \| 13213.6 [16, 256, 16, 16, 16] 4x4 stride 4 \| 606.0 \| 1059.1 \| 535.9 [16, 256, 16, 16, 16] 4x4 stride 2 \| 1037.5 \| 1491.5 \| 822.3 [16, 256, 16, 16, 16] 4x4 stride 1 \| 2835.4 \| 3306.8 \| 2522.8 [16, 256, 16, 16, 16] 2x2 stride 4 \| 438.6 \| 892.3 \| 369.0 [16, 256, 16, 16, 16] 2x2 stride 2 \| 749.2 \| 1203.7 \| 638.7 [16, 256, 16, 16, 16] 2x2 stride 1 \| 1756.1 \| 2212.5 \| 1547.0 [16, 256, 4, 4, 4] 4x4 stride 4 \| 159.6 \| 187.6 \| 117.6 [16, 256, 4, 4, 4] 4x4 stride 2 \| 161.1 \| 185.5 \| 117.3 [16, 256, 4, 4, 4] 4x4 stride 1 \| 160.0 \| 148.1 \| 117.8 [16, 256, 4, 4, 4] 2x2 stride 4 \| 123.9 \| 148.3 \| 117.6 [16, 256, 4, 4, 4] 2x2 stride 2 \| 126.0 \| 151.7 \| 117.4 [16, 256, 4, 4, 4] 2x2 stride 1 \| 127.1 \| 152.3 \| 117.9 [16, 64, 32, 32, 32] 4x4 stride 4 \| 983.5 \| 1756.7 \| 1067.8 [16, 64, 32, 32, 32] 4x4 stride 2 \| 1542.4 \| 2315.2 \| 1621.5 [16, 64, 32, 32, 32] 4x4 stride 1 \| 4498.7 \| 5273.4 \| 6006.7 [16, 64, 32, 32, 32] 2x2 stride 4 \| 767.2 \| 1543.4 \| 736.7 [16, 64, 32, 32, 32] 2x2 stride 2 \| 1207.8 \| 1981.5 \| 1197.0 [16, 64, 32, 32, 32] 2x2 stride 1 \| 2603.3 \| 3367.5 \| 3161.9 [16, 64, 16, 16, 16] 4x4 stride 4 \| 169.5 \| 264.6 \| 142.8 [16, 64, 16, 16, 16] 4x4 stride 2 \| 274.6 \| 368.9 \| 216.8 [16, 64, 16, 16, 16] 4x4 stride 1 \| 723.3 \| 820.4 \| 643.2 [16, 64, 16, 16, 16] 2x2 stride 4 \| 131.4 \| 216.0 \| 116.1 [16, 64, 16, 16, 16] 2x2 stride 2 \| 199.9 \| 295.0 \| 166.8 ``` CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/80748 Approved by: https://github.com/ngimel	2022-07-08 04:26:01 +00:00
Michael Gschwind	25449292a0	Run mask test with and without nested tensor (#81008 ) Summary: Run mask test with and without nested tensor Test Plan: sandcastle Differential Revision: D37665532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81008 Approved by: https://github.com/malfet	2022-07-07 23:54:37 +00:00
Animesh Jain	1d90d6ee60	Setup for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106 ) @ezyang I am going to keep adding more skips in this PR for now. And once we have the CI running, I will replace with the appropriate decorators. cc @mlazos , we should add those tests in test_ops.py in this PR as well cc @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/80106 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-07-07 18:57:33 +00:00
albanD	c8d64ba5ec	Allow register float16 weight_norm on cpu and speed up test (#80600 ) Fixes https://github.com/pytorch/pytorch/issues/80599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80600 Approved by: https://github.com/malfet	2022-06-30 13:50:39 +00:00
otaj	db52e4b7d9	Bugfix/weakref (#80139 ) Fixes #78580 I'm back! :) cc @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/80139 Approved by: https://github.com/albanD	2022-06-28 14:51:42 +00:00

... 2 3 4 5 6 ...

1598 Commits