pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
vfdev-5	415a8f6398	Fixed issue in affine_grid_backward when grad_grid is non-contiguous (#124370 ) Description: - replaced .view with .reshape to fix the problem when grad_grid is channels last 2d/3d - added a consistency test Fixes #124154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124370 Approved by: https://github.com/lezcano	2024-04-18 16:30:10 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
eqy	2f0fc04fa3	[CUDA][64-bit indexing] Bump large tensor threshold of `test_cross_entropy_large_tensor` to 70GiB (#123772 ) `torch.cuda.max_memory_reserved()` here shows 68729962496 (about 65546 MiB). CC @malfet @crcrpar Pull Request resolved: https://github.com/pytorch/pytorch/pull/123772 Approved by: https://github.com/mikaylagawarecki	2024-04-12 19:18:20 +00:00
David Yan	63c24f73ef	Upsample2d backwards to int64_t (#123682 ) Summary: To unblock training where upsamplenearest2d involves input or output tensors which are larger than 2^31. Comes up frequently in image & video applications. Test Plan: ``` buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_upsamplingnearest2d_backward_64bit_indexing ``` Benchmarking (N5207417): ``` device_ms, cpu_ms, gb/device_ms*1000 # before changes 118.03993721008301 124.09385920000001 98.72685525972494 # after changes 118.05780944824218 124.10893509999994 98.71190944734577 ``` Differential Revision: D55625666 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123682 Approved by: https://github.com/ezyang	2024-04-10 20:26:08 +00:00
Tailing Yuan	041be901b3	fix ctc_loss zero-length/neg-length corner cases (#123193 ) Fixes #84827, fixes #86596, fixes #88047, fixes #89208. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123193 Approved by: https://github.com/mikaylagawarecki	2024-04-09 20:39:39 +00:00
eqy	d5b5012dc4	[CUDA] Raise `softmax_forward_64bit_indexing` GPU memory requirement (#116075 ) printing `torch.cuda.memory_summary()` shows ~41GiB reserved at the end of this test, not sure how it was passing previously on CUDA. CC @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/116075 Approved by: https://github.com/ptrblck, https://github.com/malfet	2024-03-21 00:03:17 +00:00
David Yan	6915a5be70	Increase numel limit to 2^63 for replicatepad1d (#122199 ) Summary: As title Test Plan: ``` CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing ``` Also benchmarked in N5106027 ``` device_ms, cpu_ms, gb/device_ms*1000 # before changes 11.058772478103638 18.912256770000006 735.4118906278957 # after changes 10.621162576675415 18.58972748 765.7121070725207 ``` Differential Revision: D55030372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199 Approved by: https://github.com/ezyang	2024-03-19 21:55:34 +00:00
jmarin	a2854ae904	Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464 ) This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present. In the original code, ``_metadata`` was handled as a ``key``. ``` # also strip the prefix in metadata if any. if "_metadata" in state_dict: ``` This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to: ``` # also strip the prefix in metadata if any. if hasattr(state_dict, "_metadata"): ``` This PR also includes the necessary test. Fixes #106942 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464 Approved by: https://github.com/mikaylagawarecki	2024-03-07 04:00:49 +00:00
Eddie Yan	967dd31621	[cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862 ) Follow-up of #95722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862 Approved by: https://github.com/Skylion007	2024-03-07 01:46:25 +00:00
Lourencom	69cedc16c5	Add padding dimension checks and tests (#121298 ) Fixes #121093 Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault: ``` torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d ``` To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298 Approved by: https://github.com/mikaylagawarecki	2024-03-06 21:55:34 +00:00
lancerts	099ff51d45	torch check the division by zero in batch_norm_update_stats (#120882 ) Fixes #120803 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882 Approved by: https://github.com/CaoE, https://github.com/malfet	2024-03-06 05:40:21 +00:00
mingfeima	34e3f6f3c9	fix segfault in torch.native_channel_shuffle when input is empty (#121199 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): fix https://github.com/pytorch/pytorch/issues/121092 `torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel. * __->__ #121199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199 Approved by: https://github.com/malfet	2024-03-06 00:46:36 +00:00
yuanx749	e317e39a02	Fix `nonlinearity` arg issue in RNN (#120234 ) Fixes #114617 This PR fix the the issue with `nonlinearity`, so that it can be passed as arg or kwarg. Alternatively, if making `nonlinearity` kwarg-only is preferred, I can revert to another commit. cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/120234 Approved by: https://github.com/mikaylagawarecki	2024-02-28 20:53:18 +00:00
Tobias Ringwald	d9a1b25807	Fixed an issue where nn.Linear would cause an internal int underflow … (#119221 ) …when trying to reshape a scalar input. Fixes #119161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119221 Approved by: https://github.com/albanD	2024-02-08 21:06:34 +00:00
lancerts	b51b27922b	Add to_empty() suggestion in the error message (#119353 ) Fixes #119293, the comprehensive documentation is [here](`0f478d9d61/docs/source/meta.rst (id11)`). Just added the suggestion into the error message so it is more informative to user. @albanD Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119353 Approved by: https://github.com/mikaylagawarecki	2024-02-08 18:30:02 +00:00
Mikayla Gawarecki	d5a718d27b	Add swap_tensors path to nn.Module._apply (#117167 ) Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass. From offline discussion, for now we are not allowing `swap_tensor` after the first module forward has been run* if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1. The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](`6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)`). Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary. From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected. If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error. `RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now* Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167 Approved by: https://github.com/albanD ghstack dependencies: #118028	2024-02-07 18:55:44 +00:00
Mikayla Gawarecki	b92819a039	Move nn.Module.load_state_dict tests from test_nn.py to separate file (#118028 ) Move these tests out so in https://github.com/pytorch/pytorch/pull/117913 where we can to run these tests with both `torch.nn.utils.set_swap_module_params_on_conversion({True/False})` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118028 Approved by: https://github.com/albanD	2024-02-05 20:17:28 +00:00
Tobias Ringwald	2de327cedc	Fixed an illegal memory access in cross entropy loss when using an index that is not a valid class (#117561 ) …dex that is not a valid class. Fixes #117532. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117561 Approved by: https://github.com/mikaylagawarecki	2024-02-02 11:03:16 +00:00
PyTorch MergeBot	df048f4da4	Revert "[RELAND] Remove deprecated fbgemm operators (#112153 )" This reverts commit `19e8ba95e5`. Reverted https://github.com/pytorch/pytorch/pull/112153 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112153#issuecomment-1921965780))	2024-02-01 18:35:19 +00:00
vfdev-5	a1dd367716	Fixed error in bicubic upsampling aa=false for uint8 input (#118389 ) Description: - Fixed error in bicubic upsampling aa=false for uint8 input. This is seen in the test suite: ```diff - self.assertLess(diff.max(), 15) + self.assertLess(diff.max(), 5) ``` While reducing the input range we do not fully remove the clipping effect that's why the threshold is 5 and not around 1. - Renamed methods - The error is mostly visible for upsampling (smaller -> larger) mode on the boundary values More details on the bug: For uint8 input and antialising=False we are using separable algorithm (using temp buffers and interpolating dimensions one by one) where interpolation weights and input indices are computed and stored using index ranges: `index_min` and `index_size`; weights outside of the `index_size` are zeros. For example, for an output point we can have index_min=10 and index_size=4 and 4 non-zero weights: so the output value is computed as ``` out_value = sum([src[i + index_min] * w for i, w in zip(range(4), weights) ]) ``` When computing index ranges and weights for output points near the boundaries we should clamp `index_min` between 0 and input_size and `index_size` becomes smaller than 4. This approach is OK for antialiasing=True but is not correct for antialiasing=False where weights are computed incorrectly: ``` -- output index i= 0 regular float32 approach: source indices: [-2, -1, 0, 1] -> outbounded values are clamped to boundaries -> [0, 0, 0, 1] interp weights: [-0.07200000000000006, 0.4600000000000001, 0.72, -0.1080000000000001] separable uint8 approach: source indices coming from index ranges (min, size): [0, 1] incorrect interp weights computed with current implementation : [1.1764705882352944, -0.17647058823529432, 0.0, 0.0] fixed interp weights in the PR: [1.108, -0.1080000000000001, 0.0, 0.0] Note: weight value corresponding to source index 0 is 1.108 = -0.07200000000000006 + 0.4600000000000001 + 0.72 and weight value corresponding to source index 1 is -0.1080000000000001 is the same as in f32 approach. ``` Quick benchmark to ensure perfs no regression: ``` [------------------------------------------------------------------------------------ Resize ------------------------------------------------------------------------------------] \| torch (2.3.0a0+gitfda85a6) PR \| torch (2.3.0a0+git0d1e705) Nightly \| Speed-up: PR vs Nightly 1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_first bilinear (400, 400) -> (224, 224) aa=False \| 440.996 (+-2.044) \| 470.824 (+-5.927) \| 1.068 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (224, 224) aa=False \| 463.565 (+-1.519) \| 497.231 (+-10.825) \| 1.073 (+-0.000) 3 torch.uint8 channels_first bilinear (400, 400) -> (700, 700) aa=False \| 1717.000 (+-28.589) \| 1915.570 (+-43.397) \| 1.116 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (700, 700) aa=False \| 1801.954 (+-22.391) \| 1981.501 (+-37.034) \| 1.100 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (224, 224) aa=False \| 199.599 (+-0.851) \| 196.535 (+-3.788) \| 0.985 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (224, 224) aa=False \| 243.126 (+-0.681) \| 240.695 (+-2.306) \| 0.990 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (700, 700) aa=False \| 686.270 (+-2.870) \| 687.769 (+-17.863) \| 1.002 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (700, 700) aa=False \| 899.509 (+-5.377) \| 899.063 (+-9.001) \| 1.000 (+-0.000) Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118389 Approved by: https://github.com/NicolasHug ghstack dependencies: #118388	2024-02-01 14:14:32 +00:00
vfdev-5	eba4bd6b86	Updated test_upsamplingBiMode2d_consistency (#118388 ) Description: - Lowered error thresholds and added input range for bicubic to make visible the inconsistency error in the implementation for upsampling (smaller -> larger) bicubic aa=false mode for uint8 input dtype - Updated out-dated comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/118388 Approved by: https://github.com/NicolasHug	2024-02-01 09:22:23 +00:00
Peter Bell	19e8ba95e5	[RELAND] Remove deprecated fbgemm operators (#112153 ) These operators are not used and have been deprecated since #72690 (Feb 2022). BC-breaking message: `TorchScript` models that were exported with the deprecated `torch.jit.quantized` API will no longer be loadable, as the required internal operators have been removed. Please re-export your models using the newer `torch.ao.quantization` API instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153 Approved by: https://github.com/jerryzh168	2024-01-30 16:32:37 +00:00
Oguz Ulgen	3b38f7b266	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 16:11:38 +00:00
PyTorch MergeBot	bb28965924	Revert "Remove skips for passing tests (#118000 )" This reverts commit `3c339b5b21`. Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))	2024-01-23 06:10:25 +00:00
Oguz Ulgen	3c339b5b21	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 03:41:23 +00:00
haozhe.zhu@intel.com	0ae952db76	enable mkldnn bf32 matmul (#116015 ) ### Testing FP32 matmul vs. mkldnn BF32 matmul on SPR single core: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 32.842 \| 38.279 \| 1.165 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 38.590 \| 73.967 \| 1.917 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 18456.267 \| 74588.002 \| 4.041 56 cores: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 1199.400 \| 1715.548 \| 1.430 M: 8192, N: 768, K: 768, trans_a: False, trans_b: True \|1129.204 \| 1708.912 \| 1.513 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False \| 3655.915 \| 7992.877 \| 2.186 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True \| 3707.993 \| 8026.191 \| 2.165 Batch: 768, M: 128, N: 64, K: 128 \| 1296.419 \| 1308.411 \| 1.009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-20 09:30:23 +00:00
CaoE	c9528a11dd	Add Half support for masked_softmax on CPU (#117028 ) Add Half support for `masked_softmax` on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117028 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 08:59:20 +00:00
vfdev-5	1a57c18760	Fixed cuda grads for interpolate::trilinear on non-contig grad output (#117373 ) Fixes #113642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117373 Approved by: https://github.com/lezcano	2024-01-15 18:05:47 +00:00
vmoens	6f0f4f12ca	[BugFix] Prevent LSTM to run with wrong input shape (#115542 ) Fixes #114874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115542 Approved by: https://github.com/mikaylagawarecki	2024-01-11 02:57:09 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Aaron Gokaslan	3fe437b24b	[BE]: Update flake8 to v6.1.0 and fix lints (#116591 ) Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling. - Replace `assert(0)` with `raise AssertionError()` - Remove extraneous parenthesis i.e. - `assert(a == b)` -> `assert a == b` - `if(x > y or y < z):`->`if x > y or y < z:` - And `return('...')` -> `return '...'` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-03 06:04:44 +00:00
Sun, Jiayi	c173a9d9b3	add Half support for layer_norm on CPU (#99590 ) ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.051 \| 0.051 \| 0.050 \| \| (8 ,8, 16) \| 0.013 \| 0.013 \| 0.013 \| 0.054 \| 0.053 \| 0.051 \| \| (32, 8, 16) \| 0.015 \| 0.014 \| 0.014 \| 0.059 \| 0.054 \| 0.052 \| \| (64, 128, 56, 56) \| 1.875 \| 0.790 \| 1.016 \| 12.845 \| 7.151 \| 6.985 \| \| (64, 128, 256, 256) \| 50.226 \| 25.462 \| 35.736 \| 328.957 \| 179.615 \| 175.618 \| Single core (icx): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.040 \| 0.041 \| 0.041 \| \| (8 ,8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.042 \| 0.042 \| 0.042 \| \| (32, 8, 16) \| 0.027 \| 0.014 \| 0.014 \| 0.048 \| 0.048 \| 0.046 \| \| (64, 128, 56, 56) \| 58.054 \| 11.034 \| 17.928 \| 108.603 \| 48.816 \| 50.244 \| \| (64, 128, 256, 256) \| 1327.758 \| 352.394 \| 496.994 \| 2846.182 \| 1224.247 \| 1218.422 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-20 01:11:15 +00:00
eqy	d55365dc05	[CUDA] Workaround shmem limit for certain input sizes in `AdaptiveAvgPool1D` (#115231 ) Reference issue #68248 CC @ptrblck @malfet @xwang233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115231 Approved by: https://github.com/mikaylagawarecki	2023-12-19 22:40:10 +00:00
PyTorch MergeBot	c006c8b50e	Revert "markDynamoStrictTest some more (#115885 )" This reverts commit `55ce4693ff`. Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))	2023-12-15 19:51:24 +00:00
rzou	55ce4693ff	markDynamoStrictTest some more (#115885 ) Featuring test_native_mha.py test_nn.py test_prims.py test_schema_check.py test_serialization.py test_show_pickle.py test_sort_and_select.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879	2023-12-15 13:19:52 +00:00
eqy	9056903b09	[CUDA] 64-bit indexing for avg_pool_backward (#114193 ) Fixes #113833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114193 Approved by: https://github.com/malfet	2023-12-15 03:58:46 +00:00
Mikayla Gawarecki	f5919335db	Fix _load_from_state_dict for num_batches_tracked in batchnorm (#115285 ) I approved https://github.com/pytorch/pytorch/pull/110850 which did the following Previously: `num_batches_tracked` not in state_dict when doing `m.load_state_dict(state_dict)` --> always overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor Now: `num_batches_tracked` not in state_dict loaded when doing `m.load_state_dict(state_dict)` --> only overwrite module's `num_batches_tracked` in `load_from_state_dict` with a 0 cpu tensor if module does not have `num_batches_tracked` This causes the following issue: ``` with torch.device('meta'): m = BatchNorm(...) m.load_state_dict(state_dict, assign=True) ``` If `num_batches_tracked` is not in `state_dict`, since `modules's` `num_batches_tracked` is present on meta device, it is not overwritten with a 0 cpu tensor. When compiling, this error is raised ``` AssertionError: Does not support mixing cuda+meta ``` I am not sure whether the explicit check for meta device makes sense as a fix, will add testing if this fix is ok Pull Request resolved: https://github.com/pytorch/pytorch/pull/115285 Approved by: https://github.com/albanD	2023-12-07 22:48:26 +00:00
Jeff Daily	4c04ae2451	[ROCm] fix test_softmax_forward_64bit_indexing_cuda OOM (#113093 ) TestNNDeviceTypeCUDA.test_softmax_forward_64bit_indexing_cuda started failing for ROCm after #112096 with the message torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 13.35 GiB. GPU 0 has a total capacity of 31.98 GiB of which 3.89 GiB is free. Of the allocated memory 26.69 GiB is allocated by PyTorch, and 18.91 MiB is reserved by PyTorch but unallocated. This amounts to approximately 41GB. The test is currently decorated with `largeTensorTest("30GB", "cuda")` but this is not sufficient for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113093 Approved by: https://github.com/malfet	2023-11-07 03:00:37 +00:00
Eddie Yan	e39668770a	[CUDA] 64-bit indexing fixes for cross-entropy kernels (#112096 ) For #108345, #111484 Addresses the forward kernels implicated in the issues, but will take another look at the backward kernels (in follow-up PRs if necessary). The spatial softmax kernel is changed to use signed integer indexing rather than unsigned as `ScalarType` only has signed integer types declared for now, but this should be a minor change. CC @ptrblck @crcrpar (who landed a few related PRs recently). Pull Request resolved: https://github.com/pytorch/pytorch/pull/112096 Approved by: https://github.com/mikaylagawarecki	2023-11-06 17:37:08 +00:00
Tobias Ringwald	29716e865c	Enforce both input tensor shapes of CosineEmbeddingLoss to be equal. (#112782 ) …Added a test to prevent regressions. Fixes #112732. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112782 Approved by: https://github.com/lezcano	2023-11-03 15:15:06 +00:00
Tristan Rice	013f622dd2	grid_sample: support bfloat16 (#112331 ) This adds bfloat16 support to `torch.nn.functional.grid_sample` this is particularly important when doing feature sampling such as for rendering techniques used in PyTorch3d or for camera projections to voxel grids such as in SimpleBEV. Related to #57707 Test plan: ``` pytest test/test_nn.py -k grid_sample pytest test/test_ops.py -k grid_sample ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112331 Approved by: https://github.com/zou3519	2023-10-30 19:31:41 +00:00
Cao E	1c89ea7f72	Add Half support for softmax and log_softmax on CPU (#103315 ) Add Half support for softmax and log_softmax on CPU. Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/malfet	2023-10-26 08:38:54 +00:00
pbialecki	17b732eb04	increase CPU memory requirement for test_nll_loss_large (#110963 ) Running `python test_nn.py -v -k test_nll_loss_large_tensor` on a machine with a small host RAM availability (e.g. ~50GB) fails with a `SIGKILL` even though the currently specified memory requirements for CPU (and GPU) are set to 48GB and are thus met. Profiling the peak memory usage via: ``` \time -v python test_nn.py -v -k test_nll_loss_large_tensor ``` and adding `print(torch.cuda.memory_summaryu())` at the end of the test shows a higher host RAM usage of >100GB and a device memory usage of ~32GB. ``` Command being timed: "python test_nn.py -v -k test_nll_loss_large_tensor" User time (seconds): 81.66 System time (seconds): 229.02 Percent of CPU this job got: 671% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:46.30 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 118150096 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 90280839 Voluntary context switches: 1669 Involuntary context switches: 1214548 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` ``` \| PyTorch CUDA memory summary, device ID 0 \| \|---------------------------------------------------------------------------\| \| CUDA OOMs: 0 \| cudaMalloc retries: 0 \| \|===========================================================================\| \| Metric \| Cur Usage \| Peak Usage \| Tot Alloc \| Tot Freed \| \|---------------------------------------------------------------------------\| \| Allocated memory \| 32769 MiB \| 32769 MiB \| 81923 MiB \| 49154 MiB \| \| from large pool \| 32768 MiB \| 32768 MiB \| 81921 MiB \| 49152 MiB \| \| from small pool \| 0 MiB \| 0 MiB \| 1 MiB \| 1 MiB \| \|---------------------------------------------------------------------------\| \| Active memory \| 32769 MiB \| 32769 MiB \| 81923 MiB \| 49154 MiB \| \| from large pool \| 32768 MiB \| 32768 MiB \| 81921 MiB \| 49152 MiB \| \| from small pool \| 0 MiB \| 0 MiB \| 1 MiB \| 1 MiB \| \|---------------------------------------------------------------------------\| \| Requested memory \| 32769 MiB \| 32769 MiB \| 81923 MiB \| 49154 MiB \| \| from large pool \| 32768 MiB \| 32768 MiB \| 81921 MiB \| 49152 MiB \| \| from small pool \| 0 MiB \| 0 MiB \| 1 MiB \| 1 MiB \| \|---------------------------------------------------------------------------\| \| GPU reserved memory \| 32774 MiB \| 32774 MiB \| 81938 MiB \| 49164 MiB \| \| from large pool \| 32772 MiB \| 32772 MiB \| 81930 MiB \| 49158 MiB \| \| from small pool \| 2 MiB \| 2 MiB \| 8 MiB \| 6 MiB \| \|---------------------------------------------------------------------------\| ... ``` We haven't seen this issue before as the majority of our runners have sufficient host RAM and I just ran into it by chance. CC @atalman @malfet @crcrpar Pull Request resolved: https://github.com/pytorch/pytorch/pull/110963 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy, https://github.com/malfet	2023-10-25 23:45:47 +00:00
PyTorch MergeBot	5ce8002d24	Revert "Remove deprecated fbgemm operators (#104535 )" This reverts commit `57c7aa12db`. Reverted https://github.com/pytorch/pytorch/pull/104535 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104535#issuecomment-1779650412))	2023-10-25 16:34:16 +00:00
Oleg Bulatov	192477b5ba	Enable flake8-bugbear B020 lint (#110823 ) Fixes part of https://github.com/pytorch/pytorch/issues/106571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823 Approved by: https://github.com/Skylion007	2023-10-24 22:43:47 +00:00
FFFrog	0e0f6a248d	Fix num_batches_tracked of BatchNorm when load_state_dict (#110850 ) Fixes #110361 as the title shown Pull Request resolved: https://github.com/pytorch/pytorch/pull/110850 Approved by: https://github.com/mikaylagawarecki	2023-10-24 04:20:38 +00:00
Peter Bell	57c7aa12db	Remove deprecated fbgemm operators (#104535 ) These operators are not used and have been deprecated since #72690 (Feb 2022). Additionally, the `torch.jit.quantized` interface has been deprecated since #40102 (June 2020). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104535 Approved by: https://github.com/ezyang	2023-10-22 06:10:09 +00:00
CaoE	54c28c564f	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima	2023-09-19 10:43:33 +00:00
lezcano	653c1564bf	Fix broadcasting cosine_similarity (#109363 ) Fixes https://github.com/pytorch/pytorch/issues/109333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109363 Approved by: https://github.com/peterbell10	2023-09-15 17:12:35 +00:00

1 2 3 4 5 ...

1468 Commits