pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	6581063583	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" This reverts commit `db0ce4acf3`. Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board	2022-12-09 16:32:25 +00:00
Mark Saroufim	db0ce4acf3	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-12-09 04:32:31 +00:00
PyTorch MergeBot	22a249e44e	Revert "[Inductor] More robust stride and offset extraction from index expressions (#90184 )" This reverts commit `71f27f7688`. Reverted https://github.com/pytorch/pytorch/pull/90184 on behalf of https://github.com/ngimel due to catastrophically regresses performance	2022-12-08 05:04:15 +00:00
Peter Bell	71f27f7688	[Inductor] More robust stride and offset extraction from index expressions (#90184 ) Currently the stride and offset are determined by substituting 1 and 0 for different indices, which will fail for any expression that doesn't match the expected stride calculation. Instead, this uses `sympy.match` and returns `None` for any variables used in non-standard index calculation, e.g. `torch.roll` which uses `ModularIndexing`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90184 Approved by: https://github.com/jansel	2022-12-07 01:43:21 +00:00
XiaobingSuper	2597d5d722	TorchDynamo: always convert flexiblelayout to be FixedLayout when given a stride_order (#89904 ) For convolution, we always call require_stride_order to convert the input to the target stride order, if the original input's layout is flexiblelayout, there always have a memory copy because the is_stride_order_storage_and_layout only checks the init stride order, I think for flexiblelayout, means it's layout can be changed, if the user gives a stride order, I think we always need to convert the flexiblelayout to be FixedLayout using given strider order. Given a CV user case, the max_pooling's output is used by two convolutions, there has two memory copies: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1, float* __restrict__ out_ptr2) { #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<3; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<3; i3+=1) { { { auto tmp0 = in_ptr0[i3 + (6i2) + (42i1) + (147i0)]; auto tmp1 = in_ptr0[3 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp3 = in_ptr0[6 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp5 = in_ptr0[21 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp7 = in_ptr0[24 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp9 = in_ptr0[27 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp11 = in_ptr0[42 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp13 = in_ptr0[45 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp15 = in_ptr0[48 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0); auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2); auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4); auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6); auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8); auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10); auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12); auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14); out_ptr0[i3 + (3i2) + (9i1) + (27i0)] = tmp16; } } } } } } #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<9; i2+=1) { { { auto tmp0 = out_ptr0[i1 + (3i2) + (27i0)]; out_ptr1[i1 + (3i2) + (27i0)] = tmp0; out_ptr2[i1 + (3i2) + (27i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) buf2 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) buf4 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(buf4.data_ptr())) del arg4_1 del buf0 buf3 = torch.ops.mkldnn._convolution_pointwise(buf2, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3)) del arg0_1 del arg1_1 del buf2 buf5 = torch.ops.mkldnn._convolution_pointwise(buf4, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf5, (128, 3, 3, 3), (27, 1, 9, 3)) del arg2_1 del arg3_1 return (buf3, buf5, ) ``` After this PR, the generated code will remove the redundant memory copy: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<3; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<3; i3+=1) { { { auto tmp0 = in_ptr0[i3 + (6i2) + (42i1) + (147i0)]; auto tmp1 = in_ptr0[3 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp3 = in_ptr0[6 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp5 = in_ptr0[21 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp7 = in_ptr0[24 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp9 = in_ptr0[27 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp11 = in_ptr0[42 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp13 = in_ptr0[45 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp15 = in_ptr0[48 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0); auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2); auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4); auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6); auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8); auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10); auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12); auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14); out_ptr0[i3 + (3i2) + (9i1) + (27i0)] = tmp16; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg4_1 buf2 = torch.ops.mkldnn._convolution_pointwise(buf0, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf2, (128, 3, 3, 3), (27, 1, 9, 3)) del arg0_1 del arg1_1 buf3 = torch.ops.mkldnn._convolution_pointwise(buf0, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3)) del arg2_1 del arg3_1 return (buf2, buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89904 Approved by: https://github.com/jansel	2022-12-06 03:07:53 +00:00
Natalia Gimelshein	1ea20cdb33	workaround for indexing formulas with negative terms (#89933 ) Fixes https://github.com/pytorch/torchdynamo/issues/1928 For `ModularIndexing` we generate indexing code with `//` and `%` operators. When `ModularIndexing` base is negative (that can happen after valid simplifications), `//` in triton produces wrong results https://github.com/openai/triton/issues/619/. For `//` op coming from pytorch, we have codegen workarounds, but I'm reluctant to apply these workarounds to very common indexing computation patterns, both for code readability and perf considerations. Similarly, we replace `ModularIndexing` with `IndexingDiv` when we can prove that base is small, but those assumptions break when `ModularIndexing` base is negative (`ModularIndexing` is always positive, `IndexingDiv` isn't). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89933 Approved by: https://github.com/jansel	2022-12-05 19:12:29 +00:00
Jean Schmidt	f62e54df8f	Reland "Dynamo, FX, Inductor Progress Bars (#88384 )" … (#90055 ) This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly. Original commit: #88384 (`011452a2a1`) Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3) Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): `cf3c3f2280` Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055 Approved by: https://github.com/DanilBaibak, https://github.com/malfet	2022-12-02 13:28:00 +00:00
PyTorch MergeBot	cf3c3f2280	Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 )" This reverts commit `bcf4292f04`. Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits	2022-12-02 09:57:31 +00:00
Eli Uriegas	bcf4292f04	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 ) This breaks in environments that use the fake tqdm `015b05af18/torch/hub.py (L26)` which doesn't support the 'desc' kwarg and is not iterable Original try using pytorchbot did not go through because of a merge conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489 This reverts commit `011452a2a1`. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018 Approved by: https://github.com/drisspg, https://github.com/dbort	2022-12-01 20:17:07 +00:00
Animesh Jain	68805b08d1	[benchmarks][dynamo] Trying CI - Set train() for TIMM models accuracy tests (#89780 ) Moving to train mode for TIMM models and also raising batch size for accuracy testing. Raising batch size seems to remove a lot of noise/instability coming from batch_norm decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89780 Approved by: https://github.com/ngimel	2022-11-30 12:57:35 +00:00
Mark Saroufim	011452a2a1	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-11-30 06:07:14 +00:00
Elias Ellison	1a33b7cbfa	Make fake tensors preserve dense strides in type conversion (#89803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89803 Approved by: https://github.com/ngimel	2022-11-30 01:28:51 +00:00
XiaobingSuper	0c4f3db7bf	TorchDynamo: weight prepack for mkl linear (#89109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89109 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-25 01:20:19 +00:00
XiaobingSuper	07151a6bd6	TorchDynamo: weight prepack for onednn convolution external call (#88988 ) This PR is about enabled weight prepack using the MKLDNN tensor: 1. enable fake tensor mode for MKLDNN tensor input. 2. make convolution fusion kernel support MKLDNN tensor input. 3. do the weight prepack at FX fusion step. For better performance, we always use channels_last for CPU convolution path. because we test that the channels_last path can get a better performance than block input path, and also avoid the activation's layout conversion(plain to block, block to plain), currently, there only need plain to plain format conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88988 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-25 01:16:11 +00:00
Elias Ellison	72110d7833	Fix Upsample Decomp Striding For Small Channels (#89528 ) Fix for https://github.com/pytorch/torchdynamo/issues/623. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89528 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2022-11-23 20:47:39 +00:00
Natalia Gimelshein	a188f05e8c	Reland #89031 Added conv constraint that infers layouts (#89530 ) Relands #89031 Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530 Approved by: https://github.com/Chillee	2022-11-23 20:18:54 +00:00
Brian Hirsh	57353c9608	first draft of input mutation handling for aot autograd (#88817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88817 Approved by: https://github.com/ezyang, https://github.com/wconstab	2022-11-23 19:20:11 +00:00
Animesh Jain	120d200620	Revert "Added conv constraint that infers layouts (#89031 )" (#89451 ) This reverts commit `716f70f19a`. Fixes performance regression and compilation latency increase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89451 Approved by: https://github.com/soumith, https://github.com/jansel	2022-11-22 02:20:50 +00:00
Horace He	716f70f19a	Added conv constraint that infers layouts (#89031 ) The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts. So, for example, we'll often have something like ``` a: channels-last b = foo(a) c = convolution(a) ``` In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel. However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done). This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors. The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes within `aten.convolution`, which are particularly bad (since Inductor can't fuse them). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031 Approved by: https://github.com/ngimel, https://github.com/jansel	2022-11-17 01:52:35 +00:00
Elias Ellison	73d71ae3d6	[WIP] Unwrap View in Reinterpret View (#89016 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89016 Approved by: https://github.com/ngimel	2022-11-15 04:40:13 +00:00
XiaobingSuper	7a37bbed15	Take input striding for conv fusion op based on eager output (#88864 ) As https://github.com/pytorch/pytorch/pull/88706, we also change the input stride check using eager output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88864 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-15 00:55:07 +00:00
XiaobingSuper	15ef0660c5	Fake Tensor For (ConvFusion) Propagation (#88414 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88414 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-14 12:35:09 +00:00
XiaobingSuper	4ad7b17fab	TorchDynamo: Add convolution binary(inplace) fusion for cpu in inference mode (#88403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88403 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-14 08:42:40 +00:00
Elias Ellison	8ff2e34ca6	Take input striding for conv forward based on eager output (#88706 ) From discussion with @Chillee and @ngimel we'll likely need further fixes to ensure that we hit channels last kernels but this is still worth landing in its own right. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88706 Approved by: https://github.com/ngimel	2022-11-11 17:29:15 +00:00
Nikolay Korovaiko	c961e45ee5	handle zero dims in reductions (#88280 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88280 Approved by: https://github.com/ngimel	2022-11-11 01:13:57 +00:00
Michael Lazos	c1553880de	Have kernel names include fused ops (#88624 ) - Propagates origin fx nodes through inlining during lowering - Concatenates op names into kernel name - Adds config to cap the number of ops in the kernel name so they don't get too long Caveats: - The ordering in the name may not match the order that the ops are executed in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-11-10 21:38:06 +00:00
XiaobingSuper	3e43ff2794	torchdynamo: add convolution add(relu) inplace fusion kernel (#88048 ) This PR is about add convolution add(relu) inplace fusion kernel which works for other.add_(conv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/88048 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-10 13:54:37 +00:00
Elias Ellison	2381548071	add stride constraints to fallbacks (#88534 ) Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel. Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534 Approved by: https://github.com/ngimel	2022-11-10 01:13:44 +00:00
Bin Bao	f11f0e4a03	[inductor] Handle nested tuple/list output in fallback kernel (#88495 ) Summary: Currently fallback kernel in inductor assumes its output is either a tensor or a tuple/list of tensors. This PR makes it handle more generic output data structure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88495 Approved by: https://github.com/jansel	2022-11-09 15:50:45 +00:00
Bin Bao	955cbe610b	[inductor] Handle the case where kwargs contains tensor (#88417 ) Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805; currently inductor does not allow any tensor in kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88417 Approved by: https://github.com/ngimel	2022-11-04 20:29:03 +00:00
XiaobingSuper	71f793d312	TorchDynamo: Add linear binary fusion for cpu in BF16 inference mode (#87066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87066 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-04 02:40:29 +00:00
Elias Ellison	7d95b1e344	Run all fallback kernels with FakeTensor (#88248 ) This improves the memory compression of resnet18 from .84 -> .94 on inductor no-cudagraphs. It does mean that any extern kernel which incorrectly computes strides will be a hard error at runtime, but that's an issue we are going to have to face with dynamic shapes anyway. CC @ezyang, @SherlockNoMad Pull Request resolved: https://github.com/pytorch/pytorch/pull/88248 Approved by: https://github.com/ezyang	2022-11-04 02:06:38 +00:00
XiaobingSuper	e4efea4f14	TorchDynamo: Add linear unary fusion for cpu in BF16 inference mode (#87065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87065 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-04 01:26:08 +00:00
XiaobingSuper	52173188ef	TorchDynamo: Add convolution binary fusion for cpu in inference mode (#87064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87064 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-04 01:10:05 +00:00
Natalia Gimelshein	b4fcfe77b2	reduce the number of autotuning iterations, don't autotune simple til… (#88386 ) …ed copies Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s. Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386 Approved by: https://github.com/jansel	2022-11-03 15:58:18 +00:00
PyTorch MergeBot	a8561c4571	Revert "[inductor] Handle the case where kwargs contains tensor (#88215 )" This reverts commit `983c0e7f31`. Reverted https://github.com/pytorch/pytorch/pull/88215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it breaks trunk https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613987333 with a failure in test_torchinductor_opinfo.py	2022-11-02 23:33:15 +00:00
Bin Bao	983c0e7f31	[inductor] Handle the case where kwargs contains tensor (#88215 ) Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805; currently inductor does not allow any tensor in kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88215 Approved by: https://github.com/ngimel	2022-11-02 19:50:16 +00:00
Elias Ellison	e6ea0a4a4b	Don't Require contiguous For Extern Kernels (#87650 ) cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87650 Approved by: https://github.com/desertfire	2022-11-01 20:20:42 +00:00
Elias Ellison	9835413009	Fake Tensor For (Conv) Propagation (#87641 ) Resubmitting https://github.com/pytorch/pytorch/pull/87302 so it can be ghstack'd with the pr below. Incorrect strides in any meta impl would lead to runtime assertion errors for fallback kernels, so start by just enabling it for conv. Replaces https://github.com/pytorch/pytorch/pull/87588. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87641 Approved by: https://github.com/jansel	2022-10-29 04:14:01 +00:00
XiaobingSuper	c36db82e12	TorchDynamo: Add convolution unary fusion for cpu in inference mode (#87063 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87063 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-10-27 06:55:32 +00:00
Natalia Gimelshein	59aacc40ca	Couple fixes for argmax/argmin (#87758 ) Removes a wrong assert, makes min number of warps = 2 (1 for some reason generates invalid code, https://github.com/openai/triton/issues/802). Hopefully fixes https://github.com/pytorch/torchdynamo/issues/1743, cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @mreso Pull Request resolved: https://github.com/pytorch/pytorch/pull/87758 Approved by: https://github.com/Chillee, https://github.com/soumith	2022-10-26 06:33:43 +00:00
Animesh Jain	ebe5aad466	[inductor] Revert channels-last support (#87588 ) We witnessed slow compilation times last week. Earlier, I thought it was due to parallel compilation. But, after git bisect, I found the source of extra time to be my PR - https://github.com/pytorch/pytorch/pull/87049 For 1x1 kernel, the current striding check incorrectly declares channels-first 1x1 convs to channels last. I am not sure why it caused so much compilation time jump. Or why it did not fail? There was no change in performance speedup. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu to identify what could be source of this compilation time increase, so that we can manually check that part of the stack. With this `res2next50` compilation time went back to 96 seconds (which was raised to 900 seconds with my earlier PR) for single thread. And parallel-compilation brings it down to ~30 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87588 Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/ngimel	2022-10-25 19:58:25 +00:00
Animesh Jain	c4fecff97d	[inductor] Prevent aggressive fusion during inductor lowering (#87447 ) Fixes https://github.com/pytorch/torchdynamo/issues/1599 Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with. In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds. Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node. I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one. @ngimel @jansel @Chillee cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447 Approved by: https://github.com/jansel	2022-10-24 21:53:17 +00:00
Soumith Chintala	7caeac1718	[inductor] Fix channels_last conv2d propagation when CuDNN is not found (#87266 ) Fixes https://github.com/pytorch/torchdynamo/issues/1701 cc @jansel @lezcano @fdrocha @mlazos @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87266 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/voznesenskym	2022-10-21 06:36:16 +00:00
Horace He	68e946b0c3	Fixed tune_layout to not do anything for non-2d convolutions (#87328 ) cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87328 Approved by: https://github.com/ngimel	2022-10-20 18:02:51 +00:00
Horace He	2418ddb1ec	Unified symbolic shape variables between Inductor and AOTDispatcher (#87161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161 Approved by: https://github.com/jansel	2022-10-19 04:50:34 +00:00
Zachary DeVito	d36c284d14	[triton] allow cuda properties to be queried from workers (#87101 ) Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork. Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down. This just moves the needed properties from the main trainer process to the workers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101 Approved by: https://github.com/soumith	2022-10-18 04:48:29 +00:00
Animesh Jain	2b558138cf	[inductor] Set correct strides in fallback example run (#87049 ) Fixes #ISSUE_NUMBER Helps in resolving many issues seen in https://github.com/pytorch/torchdynamo/issues/1675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87049 Approved by: https://github.com/jansel	2022-10-17 15:43:53 +00:00
Jason Ansel	30f6f6903c	[inductor] Move size asserts to C++, fix bug (#87028 ) Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression). This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension. This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028 Approved by: https://github.com/anijain2305	2022-10-16 20:17:22 +00:00
Jason Ansel	c7c09722ad	Move TorchDynamo into PyTorch core (#86461 ) Context: https://github.com/pytorch/torchdynamo/issues/1588 This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core. - `torchdynamo` becomes `torch._dynamo` - `torchinductor` becomes `torch._inductor` This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461 Approved by: https://github.com/voznesenskym	2022-10-13 23:18:06 +00:00

50 Commits