pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Nick Gibson	db2e9c1e7f	[NNC] Intermediate allocs flattened and dependency support (#49554 ) Summary: Makes two changes in NNC for intermediate buffer allocations: 1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages. 2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff). I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49554 Reviewed By: VitalyFedyunin Differential Revision: D25643133 Pulled By: nickgg fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c	2020-12-21 10:35:15 -08:00
Elias Ellison	9056173acc	[NNC] Dont inline outputs buffers on cpu (#49488 ) Summary: In https://github.com/pytorch/pytorch/pull/48967/ we enabled output buffer inlining, which results in duplicate computation if one output depends on another. This was done to fix correctness for CUDA, but is not needed for correctness for CPU and results in perf slowdown. The output buffer inlining solution for CUDA is intended to be an interim solution because it does not work with reductions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49488 Reviewed By: ezyang Differential Revision: D25596071 Pulled By: eellison fbshipit-source-id: bc3d987645da5ce3c603b4abac3586b169656cfd	2020-12-16 16:28:25 -08:00
Nick Gibson	5469aa5e7f	[NNC] Add a non functional Tensor kind (#48750 ) Summary: Adds the CompoundTensor, a specialisation of the NNC Tensor which allows arbitrary production statements. This will allow lowering of aten ops into specific NNC IR patterns (which don't need to be functional) - allowing us to shortcut to the optimized form of common patterns. This is part 1 of trying to clean up the lowering of aten::cat so it is easier to optimize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48750 Reviewed By: tugsbayasgalan Differential Revision: D25433517 Pulled By: nickgg fbshipit-source-id: de13c4719f8f87619ab254e5f324f13b5be1c9da	2020-12-10 19:43:50 -08:00
Nick Gibson	c5bc6b40ab	[NNC] Dead Store Elimination (#49030 ) Summary: Adds a new optimization method to LoopNest which eliminates stores that do not contribute to any output. It's unlikely any of the lowerings of aten operators produce these stores yet, but this creates some wiggle room for transformations in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49030 Reviewed By: tugsbayasgalan Differential Revision: D25434538 Pulled By: nickgg fbshipit-source-id: fa1ead82e6f7440cc783c6116b23d0b7a5b5db4b	2020-12-09 18:49:53 -08:00
Mikhail Zolotukhin	2b70bcd014	[TensorExpr] Enable inlining for output tensors too. (#48967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48967 We previously didn't inline output tensors which resulted in correctness issues like #48533. This PR allows inlining for output tensors too - this could result in duplicated computations, but we can address that later once correctness is ensured. Performance results on FastRNNS: Before the fix: ``` Benchmarking LSTMs... name avg_fwd std_fwd avg_bwd std_bwd cudnn 10.09 0.05431 17.55 0.2108 aten 21.52 0.1276 26.7 1.471 jit 13.25 0.8748 22.47 1.73 jit_premul 11.43 0.3226 19.43 2.231 jit_premul_bias 11.84 0.2245 20.33 2.205 jit_simple 13.27 0.9906 22.15 0.9724 jit_multilayer 13.38 0.8748 22.82 1.01 py 33.55 4.837 46.41 6.333 ``` After the fix: ``` Benchmarking LSTMs... name avg_fwd std_fwd avg_bwd std_bwd cudnn 10.09 0.05979 17.45 0.1987 aten 21.21 0.144 26.43 0.7356 jit 13.01 0.2925 23.21 0.8454 jit_premul 11.4 0.3905 19.62 2.448 jit_premul_bias 11.85 0.2461 20.29 0.6592 jit_simple 13.08 0.8533 22.81 1.315 jit_multilayer 12.93 0.1095 23.57 1.459 py 31.21 2.783 44.63 6.073 ``` Differential Revision: D25383949 Test Plan: Imported from OSS Reviewed By: SplitInfinity Pulled By: ZolotukhinM fbshipit-source-id: 16f5727475109a278499bef7905f6aad18c8527a	2020-12-08 13:24:40 -08:00
Bert Maher	07657b6001	[tensorexpr] Switch cpp tests to pure gtest (#48160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160 We no longer use the custom c++ test infra anyways, so move to pure gtest. Fixes #45703 ghstack-source-id: 116977283 Test Plan: `buck test //caffe2/test/cpp/tensorexpr` Reviewed By: navahgar, nickgg Differential Revision: D25046618 fbshipit-source-id: da34183d87465f410379048148c28e1623618553	2020-11-18 12:23:34 -08:00
Raghavan Raman	fa108bd264	Add flatten loops transformation (#46365 ) Summary: This diff removes the dependency of flattening on tensors by performing flattening on loops instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46365 Reviewed By: ailzhang Differential Revision: D24366347 Pulled By: navahgar fbshipit-source-id: 4ba182f37212b6e4033cae13f8e75bc5144389f4	2020-10-16 17:05:26 -07:00
Nick Gibson	402abdfdf4	[NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869 ) Summary: Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing: Before: ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int j_1 = 0; j_1 < 10; j_1++) { B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41)); } ``` After `cacheAccesses(A->buf(), "A_local", j_loop);` ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int i_2 = 0; i_2 < 2; i_2++) { for (int j_1 = 0; j_1 < 11; j_1++) { A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40]; } } for (int j_2 = 0; j_2 < 10; j_2++) { B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]); } } ``` Or this reduction: ``` for (int l1 = 0; l1 < 4; l1++) { sum[l1] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } } ``` After `l.cacheAccesses(d->buf(), "d_local", n_loop);`: ``` for (int l1 = 0; l1 < 4; l1++) { Allocate(d_local, float, {1}); sum[l1] = 0.f; d_local[0] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } sum[l1] = (sum[l1]) + (d_local[0]); Free(d_local); } ``` I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know. This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup. While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869 Reviewed By: mruberry Differential Revision: D24195276 Pulled By: nickgg fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155	2020-10-08 14:13:28 -07:00
Mikhail Zolotukhin	4aca63d38a	[TensorExpr] Change API for creating Load and Store expressions. (#45520 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520 With this change `Load`s and `Store`s no longer accept `Placeholder`s in their constructor and `::make` functions and can only be built with `Buf`. `Placeholder` gets its own `store`, `load`, `storeWithMask`, and `loadWithMask` method for more convenient construction. Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D23998789 Pulled By: ZolotukhinM fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912	2020-09-29 20:52:38 -07:00
Mikhail Zolotukhin	3c33695a6d	[TensorExpr] Rename `Buffer` to `Placeholder`. (#45389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389 Differential Revision: D23952866 Test Plan: Imported from OSS Reviewed By: nickgg Pulled By: ZolotukhinM fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75	2020-09-29 01:21:54 -07:00
Mikhail Zolotukhin	92306b85d5	[TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388 Classes defined in these files are closely related, so it is reasonable to have them all in one file. The change is purely a code move. Differential Revision: D23952867 Test Plan: Imported from OSS Reviewed By: nickgg Pulled By: ZolotukhinM fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155	2020-09-29 01:17:10 -07:00
Nick Gibson	9e206ee9f1	[NNC] Fix a bug in SplitWithMask when splitting multiple times (#45141 ) Summary: When doing a splitWithMask we only mask if the loop extent is not cleanly divide by the split factor. However, the logic does not simplify so any nontrivial loop extents will always cause a mask to be added, e.g. if the loop had been previously split. Unlike splitWithTail, the masks added by splitWithMask are always overhead and we don't have the analysis to optimize them out if they are unnecessary, so it's good to avoid inserting them if we can. The fix is just to simplify the loop extents before doing the extent calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45141 Reviewed By: ezyang Differential Revision: D23869170 Pulled By: nickgg fbshipit-source-id: 44686fd7b802965ca4f5097b0172a41cf837a1f5	2020-09-23 14:04:58 -07:00
Nick Gibson	69839ea3f6	[NNC] make inlining immediate (take 3) (#44231 ) Summary: This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context. The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it. I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231 Reviewed By: albanD Differential Revision: D23689688 Pulled By: nickgg fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9	2020-09-15 11:12:24 -07:00
Alex Suhan	a188dbdf3f	Check for index-rank consistency in FunctionInliner (#44561 ) Summary: When caller / callee pairs are inserted into the mapping, verify that the arity of the buffer access is consistent with its declared rank. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44561 Test Plan: CI, test_tensorexpr --gtest_filter=TensorExprTest.DetectInlineRankMismatch Reviewed By: albanD Differential Revision: D23684342 Pulled By: asuhan fbshipit-source-id: dd3a0cdd4c2492853fa68381468e0ec037136cab	2020-09-14 14:07:22 -07:00
Cheng Chang	b7ef4eec46	[NNC] Add loop slicing transforms (#43854 ) Summary: Add new transforms `sliceHead` and `sliceTail` to `LoopNest`, for example: Before transformation: ``` for x in 0..10: A[x] = x2 ``` After `sliceHead(x, 4)`: ``` for x in 0..4: A[x] = x2 for x in 4..10: A[x] = x2 ``` After `sliceTail(x, 1)`: ``` for x in 0..4: A[x] = x2 for x in 4..9: A[x] = x2 for x in 9..10: A[x] = x2 ``` `sliceHead(x, 10)` and `sliceTail(x, 10)` is no-op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43854 Test Plan: Tests are added in `test_loopnest.cpp`, the tests cover the basic transformations, and also tests the combination with other transformations such as `splitWithTail`. Reviewed By: nickgg Differential Revision: D23417366 Pulled By: cheng-chang fbshipit-source-id: 06c6348285f2bafb4be3286d1642bfbe1ea499bf	2020-09-11 12:09:12 -07:00
Cheng Chang	28bd4929bd	[NNC] Make it able to normalize loop with variable start (#44133 ) Summary: Loops with variable start can also be normalized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44133 Test Plan: updated testNormalizeStartVariable. Reviewed By: navahgar Differential Revision: D23507097 Pulled By: cheng-chang fbshipit-source-id: 4e9aad1cd4f4a839f59a00bf8ddf97637a1a6648	2020-09-09 23:05:57 -07:00
Mikhail Zolotukhin	6474057c76	Revert D23503636: [pytorch][PR] [NNC] make inlining immediate (take 2) and fix bugs Test Plan: revert-hammer Differential Revision: D23503636 (`70aecd2a7f`) Original commit changeset: cdbdc902b7a1 fbshipit-source-id: b5164835f874a56213de4bed9ad690164eae9230	2020-09-04 10:58:23 -07:00
Nick Gibson	70aecd2a7f	[NNC] make inlining immediate (take 2) and fix bugs (#43885 ) Summary: A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two. This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches. This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs). This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list: * When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body. * When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined. * `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885 Reviewed By: gmagogsfm Differential Revision: D23503636 Pulled By: nickgg fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa	2020-09-03 16:49:24 -07:00
Raghavan Raman	100649d6a9	Normalize loops with non-zero start. (#43179 ) Summary: This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately. This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179 Reviewed By: nickgg Differential Revision: D23220534 Pulled By: navahgar fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020	2020-08-21 12:37:27 -07:00
Nick Gibson	944ac133d0	[NNC] Remove VarBinding and go back to Let stmts (#42634 ) Summary: Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements. So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634 Reviewed By: mruberry Differential Revision: D22969771 Pulled By: nickgg fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1	2020-08-07 10:50:38 -07:00
Alexandru Suhan	1848b43c4d	[NNC] Add loop unroll transformation (#42465 ) Summary: Unroll a loop with constant boundaries, replacing it with multiple instances of the loop body. For example: ``` for x in 0..3: A[x] = x*2 ``` becomes: ``` A[0] = 0 A[1] = 2 A[2] = 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465 Test Plan: `test_tensorexpr` unit tests. Reviewed By: agolynski Differential Revision: D22914418 Pulled By: asuhan fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51	2020-08-05 20:46:32 -07:00
Nick Gibson	f47e00bdc3	[NNC] Bounds Inference: make inferred bounds respect gaps (#42185 ) Summary: A heavy refactor of bounds inference to fix some issues and bugs blocking using it to analyze cross thread interactions: * We were merging all accesses to a Buf into a single bounds info entry, even if they did not overlap. E.g. if we accessed a[0:2] and a[5:6] we would merge that into a bound of a[0:6]. I've changed this behaviour to merge only overlapping bounds. * We were not separating bounds of different kinds (e.g. Load vs Store) and would merge a Store bounds into a Load bounds, losing the information about what kind of access it was. E.g. this loop would produce bounds: [{Load, 0, 10}] and now produces bounds [{Load, 0, 9}, {Store, 1, 10}]: ``` for i in 1 to 10... x[i] = x[i-1] ``` * Both ComputeAt and Rfactor relied on the overzealous merging and only used a single entry in the bounds list to determine the bounds of temporary buffers they created, which could result in temporary buffers allocated smaller than accesses to them. I've fixed Rfactor, but not ComputeAt - however all ComputeAt tests still pass (may require loop fusion to trigger this issue) - I will come back to it. Being more precise about bounds is more complex, rather than taking the minimum of starts and maximum of stops we now need to determine if two bounds overlap or are adjacent. There are many edge cases and so I've added a bunch of test coverage of the merging method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42185 Reviewed By: mruberry Differential Revision: D22870391 Pulled By: nickgg fbshipit-source-id: 3ee34fcbf0740a47259defeb44cba783b54d0baa	2020-07-31 20:22:04 -07:00
Nick Gibson	aa91a65b59	[TensorExpr] Fix propagation of loop options when splitting loops (#40035 ) Summary: Fix a bug in SplitWithTail and SplitWithMask where loop_options such as Cuda block/thread bindings are overwritten by the split. This PR fixes this bug by propagating the loop options to the outer loop, which for axis bindings should be equivalent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40035 Reviewed By: ZolotukhinM Differential Revision: D22080263 Pulled By: nickgg fbshipit-source-id: b8a9583fd90f69319fc4bb4db644e91f6ffa8e67	2020-07-22 11:49:07 -07:00
Nick Gibson	5153cdbe87	[TensorExpr] fix a bug in ReorderAxis when there are trailing loops (#38841 ) Summary: Fixes a bug in reorder axis where we append the new reordered loops to the enclosing block, even if there were statements after it. e.g. with 3 Computes: ``` for (int m1 ... for (int n1 ... for (int k1 ... Body 1 for (int m2 ... for (int n2 ... for (int k2 ... Body 2 for (int m3 ... for (int n3 ... for (int k3 ... Body 3 ``` If we reorder loops m2 and k2, we were also reordering the body statements like this: ``` for (int m1 ... for (int n1 ... for (int k1 ... Body 1 for (int m3 ... for (int n3 ... for (int k3 ... Body 3 for (int k2 ... for (int n2 ... for (int m2 ... Body 2 ``` This is because we always append the new loops to their parent. This PR fixes the logic to replace the old loop root with the new loop, which keeps things consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38841 Differential Revision: D21723670 Pulled By: nickgg fbshipit-source-id: 1dee8bb153182fcaa2cabd948197577e8e80acd7	2020-05-31 22:22:45 -07:00
Nikita Shulga	c6e9e9359f	[Codemod][GleanFbcode] Remove dead includes in caffe2/test (#39023 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39023 Reviewed By: orionr Differential Revision: D21702529 fbshipit-source-id: 6945bba95609102409850b105a8a091e33b8acc9	2020-05-27 14:07:26 -07:00
Owen Anderson	65260d48c8	Fix splitWithTail to insert the tail immediately after the outer loop. (#37941 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37941 Differential Revision: D21429733 Pulled By: resistor fbshipit-source-id: 12094d990c11da8b44f32a52aa5e50b3f3575145	2020-05-07 00:05:23 -07:00
Nick Gibson	4e2ea6e013	[TensorExpr] Remove the Tensor argument from loopnest.reorderAxis (#37873 ) Summary: Remove the requirement for the axes provided to reorderAxis to come from a Tensor. We were using that to determine the relevant loops, but we can alternatively determine it by traversing the parents of each provided For. resistor does this work for you? Pull Request resolved: https://github.com/pytorch/pytorch/pull/37873 Differential Revision: D21428016 Pulled By: nickgg fbshipit-source-id: b16b2f41cb443dfc2c6548b7980731d1e7d89a35	2020-05-06 12:02:15 -07:00
Mikhail Zolotukhin	1c0bad25f3	[TensorExpr] Add dtype to class Buf. (#36611 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36611 Currently Buf represents underlying storage but it didn't have dtype. That resulted in specifying dtypes in different places and there was no mechanism to enforce its consistency: e.g. one could've created a kFloat expression and use a kInt buffer to store its result. Now we're centralizing where the logic regarding the storage is located and we can start enforcing semantics rules. Follow-ups: we can merge Buffer and BufHandle classes as the former is now a mere wrapper over the latter. Test Plan: Imported from OSS Differential Revision: D21027356 Pulled By: ZolotukhinM fbshipit-source-id: c06aa2c4077fdcde3bb4ca622d324aece79b5a9c	2020-05-05 15:04:37 -07:00
Owen Anderson	564de515f5	Add an iterator to Block. (#37542 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37542 Differential Revision: D21314421 Pulled By: resistor fbshipit-source-id: e54d7a8a5c9c1186be59f69b5b8af030fc054b32	2020-05-01 15:12:49 -07:00
Mikhail Zolotukhin	799793f279	[TensorExpr] Cleanup IRPrinter implementation for statements. (#37050 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37050 With this change curly braces are printed as a part of Block rather than a part of the enclosing statement. It allows us, for instance, to more easily see nested blocks: now they will be printed each in its own curly-braced scope. As a side effect, I had to change how we print loop options. Previously we did it like this: ``` for (...) { // <loop options> <loop body (Block)> } ``` Now, since everything in between { and } is a part of the block, we have to do it the following way: ``` for (...) /* <loop options> / { <loop body (Block)> } ``` Note the change from '//' to '/ .. */' for the loop option comments. Test Plan: Imported from OSS Differential Revision: D21171851 Pulled By: ZolotukhinM fbshipit-source-id: 39f51a9e15aec03b6527b0634fd4b9e01a912cda	2020-04-21 23:20:18 -07:00
Mikhail Zolotukhin	b8e2d797c0	[TensorExpr] Insert allocations for temporary buffer at the innermost valid scope. (#36836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36836 Test Plan: Imported from OSS Differential Revision: D21099913 Pulled By: ZolotukhinM fbshipit-source-id: 8faf5f1d55b60bdd4f4b2b909977aeb7abaa95b4	2020-04-21 22:51:46 -07:00
Mike Ruberry	b45b9673a1	Fixes clang format (#36787 ) Summary: Fixes clang format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/36787 Differential Revision: D21084603 Pulled By: mruberry fbshipit-source-id: 7e29da135f9a2aa126cb68640e33c1914fd570e3	2020-04-17 00:42:51 -07:00
Owen Anderson	1fc3556ec9	Teach the tensorexpr vectorizer to handle nested For loops. (#36467 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36467 Differential Revision: D21013179 Pulled By: resistor fbshipit-source-id: aa4f3da58cf16934f11e0cf4252a300cbac98f21	2020-04-16 15:40:44 -07:00
Nick Gibson	ee3d046f87	[TensorExpr] Add support for Axis reordering in LoopNest (#36540 ) Summary: Adds a capability for reordering axes in the LoopNest. This was fairly straightforward except when handling Reduction initializers which required more changes, UPDATE: actually the complicated bit was preserving the ordering of statements in the loopnest which should not be reordered. Usage looks something like this: ``` Tensor* tensor = Compute( "f", {{2, "x"}, {3, "y"}}, [](const VarHandle& x, const VarHandle& y) { return ExprHandle(1.0f) + cast<float>(x) * x + cast<float>(y) * y; }); LoopNest l({tensor}); /* LoopNest looks like: for x in ... for y in ... f[x,y] = 1 + x * x + y * y; / auto loops = l.getLoopStmtsFor(tensor); l.reorderAxis(tensor, loops[0], loops[1]) / LoopNest looks like: for y in ... for x in ... f[x,y] = 1 + x * x + y * y; */ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/36540 Differential Revision: D21068143 Pulled By: nickgg fbshipit-source-id: f02c29004376df4f5a9bedff366c075772726618	2020-04-16 13:42:47 -07:00
Mikhail Zolotukhin	317f598103	[TensorExpr] Clang-format test/cpp/tensorexpr/*. (#36615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36615 Test Plan: Imported from OSS Differential Revision: D21027733 Pulled By: ZolotukhinM fbshipit-source-id: e19cd85c1634f4e40805814ac71eec719d6587f8	2020-04-14 19:08:18 -07:00
Mikhail Zolotukhin	d5ba39c25d	[TensorExpr] Postpone insertion of Alloc/Free statements in computeAt. (#36526 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36526 Test Plan: Imported from OSS Differential Revision: D21004740 Pulled By: ZolotukhinM fbshipit-source-id: 8ac8db0d4e31065e4fbd3e0cc27f15a15dcb141c	2020-04-13 22:30:00 -07:00
Mikhail Zolotukhin	df5f0a04ff	[TensorExpr] Implement LoopNest::computeAt (#36112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36112 Differential Revision: D20885662 Test Plan: Imported from OSS Pulled By: ZolotukhinM fbshipit-source-id: 4ea6293b249562fca46739dc36c5483d912e5838	2020-04-11 04:01:14 -07:00
Mikhail Zolotukhin	397aa46a3e	[TensorExpr] Bounds inference (#35120 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35120 Differential Revision: D20567926 Test Plan: Imported from OSS Pulled By: ZolotukhinM fbshipit-source-id: 89a2afcddaf23a5c6259c15e4f7194e8649c1c4d	2020-04-11 03:59:34 -07:00
Mikhail Zolotukhin	3ef5ff6012	[TensorExpr] Make Load and Store multi-dimensional. (#35800 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35800 This PR includes the following changes: * Introduce a new `Expr` type `Buf`: it plays a similar to `Var` role, but also has dimensions. * Use the new `Buf` class in `Store` and `Load` instead of `Var` for specifying where to store to or load from. `Buf` contains the dimensions info of the buffer we're loading/storing to and hence we are able to keep N-d indexes without flattening them into a 1-d index ([x,y] vs [x+yW]). Flattening of the indexes is now a separate pass that is executed in `LoopNest::prepareForCodegen` - backends still expect indexes to be flattened, and this PR preserves that. * `Tensor` now contains a `Buf` instead of `Var`, and thus Tensor now has the dimensions info (previously it was a property of a `Function`, not a `Tensor`). This brings us closer to Tensor being a combination of Buffer + Function, where Buffer specifies iteration domain and the Function defines a computation. TODOs: * Consider merging `Buffer` with `Buf` or `BufHandle`. It seems that we don't need all of them. * Harden the logic of how we create buffers in fuser pass. Currently it seems that sometimes we don't set dimensions. * Use `Buf` in `Allocate` and `Free`. * Make it clearer that `Function` doesn't "own" dimensions info and that dimensions are a property of a Tensor, not a Function. Differential Revision: D20789005 Test Plan: Imported from OSS Reviewed By: zheng-xq Pulled By: ZolotukhinM fbshipit-source-id: e04188d1d297f195f1c46669c614557d6bb6cde4	2020-04-02 11:18:28 -07:00
Mikhail Zolotukhin	ceb4ed3733	[TensorExpr] Methods name cleanup in LoopNest class. (#35174 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35174 Differential Revision: D20585575 Test Plan: Imported from OSS Pulled By: ZolotukhinM fbshipit-source-id: 0fa8e1e85e1502b9a86cf34608cb791ffb23d395	2020-03-25 11:51:11 -07:00
Mikhail Zolotukhin	95ad94c75b	[TensorExpr] Nuke tensorexpr::schedule namespace. (#35126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35126 Test Plan: Imported from OSS Differential Revision: D20569364 Pulled By: ZolotukhinM fbshipit-source-id: c0d51ecadf411918641cdbdc6d8cb06e207d2c9b	2020-03-20 23:39:14 -07:00
Mikhail Zolotukhin	65cea95777	[TensorExpr] Rename schedule.{cpp,h} to loopnest.{cpp,h}. (#35119 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35119 Differential Revision: D20567927 Test Plan: Imported from OSS Pulled By: ZolotukhinM fbshipit-source-id: 1fb6d03bd4c6e66aca62140d2b537692577f261d	2020-03-20 23:37:51 -07:00

42 Commits