pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	f7c9ef88f5	Add masked_select abstract impl (#110103 ) Fixes https://github.com/pytorch/pytorch/issues/109871 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110103 Approved by: https://github.com/bdhirsh	2023-09-27 04:07:58 +00:00
SS-JIA	dec140f1ea	[core IR] Add a core decomposition for aten.all (#110093 ) ## Context Change the ref implementation of `aten.all` to only use other `torch` operators such that we can use it for the core ATen decomposition table. This will replace the decomposition for `aten.all` that was used specifically by Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110093 Approved by: https://github.com/manuelcandales, https://github.com/peterbell10, https://github.com/lezcano	2023-09-27 01:31:41 +00:00
Richard Zou	bb9779ecd2	Revert D49640259: Revert D49615962: [optests] Test names in failure dicts should be prefixed with test class (#110094 ) Summary: Revert D49640259: Revert D49615962: [optests] Test names in failure dicts should Test Plan: revert-hammer Differential Revision: D49645397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110094 Approved by: https://github.com/izaitsevfb	2023-09-26 21:16:36 +00:00
Jane Xu	0a60219fe3	[foreach] Fix 0-size handling for real for real (#109402 ) @crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in https://github.com/pytorch/pytorch/issues/100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109402 Approved by: https://github.com/albanD	2023-09-26 17:38:20 +00:00
Li-Huai (Allan) Lin	d91492a7a4	[MPS] Fix sort with empty tensor. (#109584 ) Fixes #107284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109584 Approved by: https://github.com/kulinseth, https://github.com/albanD ghstack dependencies: #109557, #109574	2023-09-26 16:30:38 +00:00
PyTorch MergeBot	2393864070	Revert "[optests] Test names in failure dicts should be prefixed with test class (#110045 )" This reverts commit `76fcec74c4`. Reverted https://github.com/pytorch/pytorch/pull/110045 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/110045#issuecomment-1735711094))	2023-09-26 14:56:08 +00:00
rzou	ea20db8aa0	[optests] Excise unused operator_compile_check (#110011 ) The recommendation is to just use `opcheck`, which has superceded all uses of `operator_compile_check`. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/110011 Approved by: https://github.com/ezyang ghstack dependencies: #109912	2023-09-26 13:24:21 +00:00
rzou	76fcec74c4	[optests] Test names in failure dicts should be prefixed with test class (#110045 ) We want to use the same failures dict for multiple TestCase. This happens common in e.g. fbgemm. To move towards that, we need to prefix each test name with their test class to avoid ambiguity Differential Revision: [D49615962](https://our.internmc.facebook.com/intern/diff/D49615962/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110045 Approved by: https://github.com/williamwen42	2023-09-26 03:21:12 +00:00
CaoE	7c9052165a	add fp16 support for native conv and deconv on CPU (#99497 ) ### Testing Native conv vs. mkldnn conv on SPR (with avx512_fp16 support) Single core: Input \| Naïve impl / us \| oneDNN / us \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 34676789 \| 524199.8 \| 66.15185 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 33454125 \| 349844.4 \| 95.62573 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 317650.1 \| 2317.677 \| 137.0554 IC: 128, OC: 256, kernel: 3, stride: 1, N: 1, L: 64 \| 15334.68 \| 167.264 \| 91.67952 56 cores: Input \| Naïve impl / us \| oneDNN / us \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 1032064 \| 11073.58 \| 93.20061 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 1000097 \| 16371.19 \| 61.08883 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 981813.4 \| 9008.908 \| 108.9825 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 1082606 \| 10150.47 \| 106.6558 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 319980.6 \| 181.598 \| 1762.027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-09-25 01:31:26 +00:00
jjsjann123	0d3db1048a	remove nvfuser test in upstream pytorch (#109918 ) Removing nvfuser related tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/109918 Approved by: https://github.com/msaroufim	2023-09-24 13:49:37 +00:00
Jijie Wei	334ead04a9	Back out "[decomp] Fix baddbmm decomposition (#109714 )" (#109855 ) Summary: Original commit changeset: 95c462a380c9 Original Phabricator Diff: D49484954 this diff cause test failure for deterministic ne test see:https://www.internalfb.com/sandcastle/job/18014399565419856/ Test Plan: buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test -- --exact 'aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test - aps_models.ads.icvr.tests.icvr_fm_e2e_deterministic_ne_test.ICVR_FM_E2EDeterministicNeTest: test_e2e_deterministic_icvr_fm_pt2_fsdp_multi_gpus' https://www.internalfb.com/intern/testinfra/testrun/16888498605839953 Differential Revision: D49527271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109855 Approved by: https://github.com/yanboliang	2023-09-22 22:01:38 +00:00
Oguz Ulgen	1df14f1bf8	Move has_triton to top level triton utils so that dynamo can also access (#109832 ) it without creating cyclic dependencies Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832 Approved by: https://github.com/zou3519	2023-09-22 19:33:41 +00:00
Brian Hirsh	46b0b7bff7	_return_and_correct_aliasing: fix for schemas with mutable tensor in kwargs (#109662 ) I missed a few tests the first time around - this fixes out= op handling for `_return_and_correct_aliasing`, which failed a few tests in the python functionalization <> AOTAutograd PR above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109662 Approved by: https://github.com/ezyang ghstack dependencies: #108654	2023-09-22 07:09:04 +00:00
Peter Bell	36a8105f54	[decomp] Fix baddbmm decomposition (#109714 ) The decomposition is currently registered without the pw_cast_for_opmath decorator, due to the ordering of decorators being meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714 Approved by: https://github.com/lezcano	2023-09-20 18:40:21 +00:00
Salil Desai	40b2c796dc	[Decomposition] baddbmm (#108534 ) Summary: Moving decomposition of baddbmm from _inductor/decomposition.py and include it in core_aten_decompositions `ff38c0e2f9/torch/_inductor/decomposition.py (L203)` Test Plan: Phabricator + OSS Tests Differential Revision: D48871741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108534 Approved by: https://github.com/SherlockNoMad	2023-09-20 12:49:32 +00:00
rzou	122264a0c0	[generate_opcheck_tests] tests should ignore meta/FakeTensors (#109641 ) These tests generally don't work on meta tensors because they need to compare the data of the Tensors. For example, SchemaCheckMode errors out if any inputs are meta or Fake because it needs to check their storages to see if any mutation occurred and those do not have storages. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/109641 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer ghstack dependencies: #109637, #109638, #109639, #109640	2023-09-20 06:33:37 +00:00
rzou	d3d71367b9	[generate_opcheck_tests] Always print a repro (#109640 ) On failure of a test, we will always print a "repro". This repro isn't really runnable but gives the user a sense of how to actually reproduce the test without the test suite, because using the test suite is a bit convoluted. If the user passes PYTORCH_OPCHECK_PRINT_BETTER_REPRO, we will print a fuller repro that saves the exact problematic test inputs to disk and reads them back out. Test Plan: - expecttests on the generate_repro helper function - tried this out locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109640 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer ghstack dependencies: #109637, #109638, #109639	2023-09-20 06:33:37 +00:00
rzou	af900fe228	[generate_opcheck_tests] flip unified_diff order (#109639 ) It was reversed. As written this is a bit difficult to test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109639 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer ghstack dependencies: #109637, #109638	2023-09-20 06:33:37 +00:00
rzou	7564f04389	[generate_opcheck_tests] add type checking (#109638 ) Test Plan: - lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/109638 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer ghstack dependencies: #109637	2023-09-20 06:33:37 +00:00
rzou	10d575911e	[generate_opcheck_tests] rename "success" to "xsuccess" (#109637 ) Not BC breaking because no existing failures dict have "success" in them. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/109637 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2023-09-20 06:33:37 +00:00
FFFrog	70f2adaec3	Setup_context does not contain default values of forward() (#108561 ) Fixes #108529 As the title shown. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108561 Approved by: https://github.com/soulitzer	2023-09-19 16:23:52 +00:00
CaoE	54c28c564f	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima	2023-09-19 10:43:33 +00:00
leslie-fang-intel	4a60bd22b2	[Quant][Inductor] Enable quantization dynamic batch size support (#108550 ) Summary This Diff enables dynamic batch size support for quantization use case in Inductor. Take the UT in this PR as example, after this PR, the generated code will have assumption of dynamic input batch size. ``` cpp_fused_quantize_per_tensor_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h" extern "C" void kernel(const float* in_ptr0, unsigned char* out_ptr0, const long ks0, const long ks1) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(static_cast<long>(ks1ks1)); i2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i2 + (i1(static_cast<long>(ks1ks1))) + (3Li0(static_cast<long>(ks1ks1))))]; auto tmp1 = static_cast<float>(40.36037717834931); auto tmp2 = decltype(tmp0)(tmp0 * tmp1); auto tmp3 = std::nearbyint(tmp2); auto tmp4 = static_cast<float>(97.0); auto tmp5 = tmp3 + tmp4; auto tmp6 = static_cast<float>(0.0); auto tmp7 = max_propagate_nan(tmp5, tmp6); auto tmp8 = static_cast<float>(255.0); auto tmp9 = min_propagate_nan(tmp7, tmp8); auto tmp10 = static_cast<unsigned char>(tmp9); out_ptr0[static_cast<long>(i1 + (3Li2) + (3Li0(static_cast<long>(ks1ks1))))] = tmp10; } } } } } ''') cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1 = async_compile.cpp(''' #include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h" extern "C" void kernel(const unsigned char* in_ptr0, float* out_ptr0, unsigned char* out_ptr1, const long ks0, const long ks1) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(16L)) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={at::vec::Vectorized<float>(0)}) float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); for(long i2=static_cast<long>(0L); i2<static_cast<long>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))(at::native::div_floor_integer(ks1, 2L)))) + (2L(at::native::div_floor_integer(ks1, 2L)))); i2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i1 + (16Li0) + (16Li2) + (16Li0(static_cast<long>((at::native::div_floor_integer(ks1, 2L))(at::native::div_floor_integer(ks1, 2L))))) + (32Li0(at::native::div_floor_integer(ks1, 2L))))); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.010429476387798786)); auto tmp5 = tmp3 tmp4; tmp_acc0_vec = tmp_acc0_vec + tmp5; } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (16Li0))); } } } } { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(1L)) { auto tmp0 = out_ptr0[static_cast<long>(i0)]; auto tmp1 = static_cast<float>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))(at::native::div_floor_integer(ks1, 2L)))) + (2L(at::native::div_floor_integer(ks1, 2L)))); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(168.09128392896545); auto tmp4 = decltype(tmp2)(tmp2 * tmp3); auto tmp5 = std::nearbyint(tmp4); auto tmp6 = static_cast<float>(0.0); auto tmp7 = tmp5 + tmp6; auto tmp8 = max_propagate_nan(tmp7, tmp6); auto tmp9 = static_cast<float>(255.0); auto tmp10 = min_propagate_nan(tmp8, tmp9); auto tmp11 = static_cast<unsigned char>(tmp10); out_ptr1[static_cast<long>(i0)] = tmp11; } } } ''') cpp_fused_dequantize_per_tensor_2 = async_compile.cpp(''' #include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h" extern "C" void kernel(const unsigned char* in_ptr0, float* out_ptr0, const long ks0) { { for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i0)); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.0056716203689575195)); auto tmp5 = tmp3 tmp4; tmp5.store(out_ptr0 + static_cast<long>(i0)); } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg8_1, arg9_1, arg10_1 = args args.clear() s0 = arg8_1 s2 = arg9_1 assert_size_stride(arg10_1, (s0, 3, s2, s2), (3(s2s2), s2s2, s2, 1)) buf0 = empty_strided((s0, 3, s2, s2), (3(s2s2), 1, 3s2, 3), device='cpu', dtype=torch.uint8) cpp_fused_quantize_per_tensor_0(c_void_p(arg10_1.data_ptr()), c_void_p(buf0.data_ptr()), c_long(s0), c_long(s2)) del arg10_1 buf1 = torch.ops.onednn.qconv2d_pointwise(buf0, 0.024776775389909744, 97, constant5, constant2, constant3, constant0, [1, 1], [1, 1], [1, 1], 1, 95.88209060714476, 0, False, 'relu', [], '') assert_size_stride(buf1, (s0, 16, 1 + s2, 1 + s2), (16 + (16(s2s2)) + (32s2), 1, 16 + (16s2), 16)) del buf0 # Source Nodes: [quantize_per_tensor_default_2], Original ATen: [quantized_decomposed.quantize_per_tensor] buf2 = torch.ops.quantized.max_pool2d(buf1, [3, 3], [2, 2], [1, 1], [1, 1], False) del buf1 buf3 = buf2 assert_size_stride(buf3, (s0, 16, 1 + (s2 // 2), 1 + (s2 // 2)), (16 + (16((s2 // 2)(s2 // 2))) + (32(s2 // 2)), 1, 16 + (16(s2 // 2)), 16)) del buf2 buf4 = empty_strided((s0, 16, 1, 1), (16, 1, 16s0, 16s0), device='cpu', dtype=torch.float32) buf5 = empty_strided((s0, 16), (16, 1), device='cpu', dtype=torch.uint8) cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1(c_void_p(buf3.data_ptr()), c_void_p(buf4.data_ptr()), c_void_p(buf5.data_ptr()), c_long(s0), c_long(s2)) del buf3 buf6 = torch.ops.onednn.qlinear_pointwise(buf5, 0.005949148442596197, 0, constant6, constant4, constant3, constant1, 176.31645543014483, 100, False, 'none', [], '') assert_size_stride(buf6, (s0, 16), (16, 1)) del buf5 buf7 = reinterpret_tensor(buf4, (s0, 16), (16, 1)); del buf4 # reuse cpp_fused_dequantize_per_tensor_2(c_void_p(buf6.data_ptr()), c_void_p(buf7.data_ptr()), c_long(s0)) return (buf7, ) ``` TestPlan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_maxpool2d_linear_dynamic ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108550 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-09-19 08:30:16 +00:00
Andres Lugo-Reyes	9863286abf	[ROCM] Enable bwd cross_entropy on ROCM now that eps tolerance update (#109384 ) Follow up to https://github.com/pytorch/pytorch/pull/109038 The fix in the PR above also fixes this test on rocm Pull Request resolved: https://github.com/pytorch/pytorch/pull/109384 Approved by: https://github.com/jeffdaily, https://github.com/albanD	2023-09-18 22:20:38 +00:00
Aaron Gokaslan	6d725e7d66	[BE]: enable ruff rules PLR1722 and PLW3301 (#109461 ) Enables two ruff rules derived from pylint: * PLR1722 replaces any exit() calls with sys.exit(). exit() is only designed to be used in repl contexts as may not always be imported by default. This always use the version in the sys module which is better * PLW3301 replaces nested min / max calls with simplified versions (ie. `min(a, min(b, c))` => `min(a, b. c)`). The new version is more idiomatic and more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109461 Approved by: https://github.com/ezyang	2023-09-18 02:07:21 +00:00
PyTorch MergeBot	be9f73f031	Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211 )" This reverts commit `fe14e43d14`. Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing `492a93d185` https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))	2023-09-14 22:29:12 +00:00
Andres Lugo-Reyes	ea94344821	[ROCm] Enable Lerp tests for complex32 (#108100 ) Enables previously disabled "lerp" opinfo tests for chalf on ROCm Pull Request resolved: https://github.com/pytorch/pytorch/pull/108100 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/kit1980	2023-09-14 21:21:29 +00:00
Masaki Kozuki	602413a0a0	Refactor `test_foreach.py` (#107869 ) ## Summary - Change the default of `supports_autograd` and `supports_forward_ad` of `ForeachFuncInfo` to `True` - Add `test_zero_size_tensor_inputs` to make sure that foreach functions can handle 0-size Tensor inputs - Add `test_parity` to check the consistency between outputs of foreach and for-loop of native function. - Add `test_autodiff` to check forward-mode and reverse-mode AD - Keep the corner cases that are not covered by the newly introduced methods rel: - #58833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107869 Approved by: https://github.com/janeyx99	2023-09-14 19:39:26 +00:00
Edward Z. Yang	fe14e43d14	Add meta and OpInfo for _embedding_bag_dense_backward (#109211 ) The sample inputs is a bit involved because there are a lot of shenanigans in the derivative formula. Check comments. This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-09-14 18:49:32 +00:00
PyTorch MergeBot	b226373d16	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `b6a1d3fb97`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main `b6a1d3fb97` https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065))	2023-09-14 16:13:34 +00:00
David Watson	9b3f5823f3	Added test for interpolate nearest exact (#108558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108558 Approved by: https://github.com/mikaylagawarecki	2023-09-14 15:17:33 +00:00
Andres Lugo-Reyes	111b9ef390	[ROCM] Enable test_fn_fwgrad_..._functional_binary_cross_entropy on ROCM (#109038 ) Fixes #98431 This change addresses a hardware assertion that was triggered on ROCm only tests. Description of the problem: Assertion triggered ``` Device-side assertion `target_val >= zero && target_val <= one' failed. ``` The issue in question is due to a GPU side assertion in `binary_cross_entropy_out_cuda` where a `target_val` get's passed to the kernel that does not fall between 0 and 1. The value in question that triggers the assertion is -0.000000000810. The origin of this negative value comes from one of the tensors generated for the test. In this tensor, one of the values (on ROCM) is 0.000000999190 which adhere's to the restriction that it is between 0 and 1. However, this value is eventually passed as a single entry tensor to gradcheck.py::_compute_numerical_gradient ( https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L347) This function perturbs the tensor value in-place by subtracting `v` from it and then adding it back. The value of `v` comes from the default `eps` value defined here https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L2119 Currently pegged at `1e-6`. So what occurs is when an input is less than the default eps (like 0.000000999190 ), the perturbation calculation causes an entry in the tensor to flip to negative, i.e. 0.000000999190 - 1e-6 = -0.000000000810 (due to the subtraction here: https://github.com/pytorch/pytorch/blob/main/torch/autograd/gradcheck.py#L364) which then triggers the device side assertion in `binary_cross_entropy_out_cuda`. This PR loosens the EPS by an order of magnitude to get around the error. Since this issue has not been caught in the field in any meaningful way, I find this to be an adequate solution, though am happy to hear opposing viewpoints. Important to mention, while this error was only occurring on ROCm platforms, the issue described is also present in CUDA based environments. The difference being that CUDA doesn't seem to generate a tensor with any values less than `1e-6`. When injecting the small value on an Nvidia box, the same device side assertion was triggered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109038 Approved by: https://github.com/jeffdaily, https://github.com/albanD	2023-09-14 15:17:29 +00:00
Edward Z. Yang	7f1f5afc91	Run only one pytest parametrization when generating optest (#108936 ) Richard, I'm curious to see what you think of this. I'm trying to use optest on the torchvision test suite, and after hacking up pytest support in https://github.com/pytorch/pytorch/pull/108929 I noticed that this was 5x'ing the test time... for no good reason. * torchvision nms tests before optests: 60 passed, 4 skipped, 1206 deselected in 11.47s * after optests: 300 passed, 20 skipped, 1206 deselected in 49.85s It's no good reason because torchvision has parametrized the tests to get a spread of various random generation, but for checking schema or fake tensor, we don't actually need to test for different values. This PR hacks up the codegen to replace pytest parametrize markers so that, instead of sampling many values, we sample only one value if you mark it with `opcheck_only_one`. There's a carveout for device parametrization, where we always run all those variants. With this PR: * reduced optests: 88 passed, 4 skipped, 1206 deselected in 13.89s Companion torchvision PR which uses this at https://github.com/pytorch/vision/pull/7961 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108936 Approved by: https://github.com/zou3519	2023-09-14 14:54:57 +00:00
CaoE	b6a1d3fb97	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-14 12:23:59 +00:00
Jerry Zhang	c914ca7577	[quant][be] Add TestPT2ERepresentation test case (#108923 ) Summary: att Test Plan: python test/test_quantization.py TestPT2ERepresentation Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/108923 Approved by: https://github.com/andrewor14	2023-09-14 02:01:38 +00:00
Guilherme Leobas	dbddf1816a	Remove include_0d from sample_inputs_gather (#109125 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109125 Approved by: https://github.com/lezcano ghstack dependencies: #108879, #108880, #109120	2023-09-13 23:13:09 +00:00
Guilherme Leobas	d046376c4f	Dispatch `numpy.take_along_axis` to `torch.take_along_dim` (#108880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108880 Approved by: https://github.com/lezcano ghstack dependencies: #108879	2023-09-13 23:13:09 +00:00
PyTorch MergeBot	04a765f95d	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `6065e7a97c`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` `6065e7a97c` https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208))	2023-09-13 22:38:42 +00:00
CaoE	6065e7a97c	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-13 17:30:16 +00:00
drisspg	ad90ab31f2	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-13 13:59:05 +00:00
Edward Z. Yang	55f956f1d2	optests improvements based on torchvision usage on nms (#108929 ) - Update cross-ref FakeMode test to use ShapeEnv. Dynamic ops can now return an unbacked SymInt. We always accept this as equal to whatever the real value was. - Relax test so it works on all classes, not just unittest.TestCase - Properly wrap the original method, so things like pytree.mark.parametrize are carried over - Support dynamic shapes by default for make_fx `tracing_mode="fake"` without symbolifying everything else Fixes https://github.com/pytorch/pytorch/issues/108927 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108929 Approved by: https://github.com/zou3519	2023-09-13 13:26:15 +00:00
Richard Zou	bfa8429c6a	[optests] Changed failures_dict format to json; automatic update of failures_dict (#109110 ) We changed the failures_dict format from .py to json and added a way to automatically update the failures dict (the user can set PYTORCH_OPCHECK_ACCEPT=1 to do so), assuming the tests don't crash in the process. Some details: - We introduced a FailuresDict class that handles save/load and from which one can query a test status ("xfail", "skip", etc). - PYTORCH_OPCHECK_ACCEPT=1 does not override everything. In particular: it doesn't try to update the failures dict for a test marked as "skip", but it will update it for tests marked as "xfail" or "success". - PYTORCH_OPCHECK_ACCEPT=1 also does not override the "comment" field, unless it is flipping an "xfail" into "success". - I'll update the gdoc linked in the comments with how to actually use PYTORCH_OPCHECK_ACCEPT=1 internally (it's not trivial). Note that this isn't multithreading-safe, the current recommendation is to run the tests sequentially if the user wants to use PYTORCH_OPCHECK_ACCEPT=1. Differential Revision: D49167181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109110 Approved by: https://github.com/ezyang	2023-09-13 13:24:15 +00:00
Ying Zhang	a2d5f13310	[Inductor CUTLASS backend] Step 5: Gemm CUTLASS templates (#108015 ) This is the step 5 to add cutlass as an alternative inductor backend. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108015 Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov ghstack dependencies: #107802, #107847, #107901, #107931	2023-09-12 17:44:38 +00:00
Iris	b6f9d4dbc4	[DCP] Enable nD device_mesh resharding DTensor in DCP and add associated tests (#106230 ) This PR: 1. Drop assert for 1D DeviceMesh check to allow DTensor with nD DeviceMesh when creating write_item. 2. Add tests for both placement changes and mesh changes for both 1D and 2D scenarios. cc. @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/106230 Approved by: https://github.com/kumpera	2023-09-12 00:47:58 +00:00
Li-Huai (Allan) Lin	b2cba439b4	Introduce Tensor overload to linspace and logspace (#104889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889 Approved by: https://github.com/zou3519 ghstack dependencies: #107958	2023-09-11 23:30:40 +00:00
Li-Huai (Allan) Lin	293d3b89d8	Add Opinfos for the Tensor overload of linspace/logspace (#107958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107958 Approved by: https://github.com/zou3519	2023-09-11 22:30:19 +00:00
Kiarash Jamali	fb288aa99b	Add Bfloat16 support to CrossKernel.cu (#108941 ) Fixes #108940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108941 Approved by: https://github.com/mikaylagawarecki	2023-09-11 19:05:01 +00:00
PyTorch MergeBot	e276d70451	Revert "Add Opinfos for the Tensor overload of linspace/logspace (#107958 )" This reverts commit `106e0a0ef1`. Reverted https://github.com/pytorch/pytorch/pull/107958 on behalf of https://github.com/clee2000 due to I think the newly added test test_mps.py::TestConsistencyCPU::test_output_match_logspace_tensor_overload_cpu_complex64 is broken, probably a landrace since the mergebase seems to be 21 days old `106e0a0ef1` https://github.com/pytorch/pytorch/actions/runs/6149523234/job/16685849126 ([comment](https://github.com/pytorch/pytorch/pull/107958#issuecomment-1714309905))	2023-09-11 17:38:04 +00:00
PyTorch MergeBot	a7f5abeade	Revert "Introduce Tensor overload to linspace and logspace (#104889 )" This reverts commit `57e5239321`. Reverted https://github.com/pytorch/pytorch/pull/104889 on behalf of https://github.com/clee2000 due to sorry have to revert this to revert https://github.com/pytorch/pytorch/pull/107958 ([comment](https://github.com/pytorch/pytorch/pull/104889#issuecomment-1714305768))	2023-09-11 17:33:48 +00:00
Li-Huai (Allan) Lin	57e5239321	Introduce Tensor overload to linspace and logspace (#104889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889 Approved by: https://github.com/zou3519 ghstack dependencies: #107958	2023-09-11 15:29:39 +00:00

1 2 3 4 5 ...

4093 Commits