pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Mikhail Zolotukhin	f23f21dafe	[TensorExpr] Remove 'Placeholder' class. (#64887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887 BufHandle has exactly the same functionality and should be used instead. Differential Revision: D30889483 D30889483 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3	2021-09-14 00:22:44 -07:00
Bert Maher	2e6221a232	[nnc] Make 64-bit dimensions work (#64077 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077 We were assuming kernel dimensions fit in 32 bits (the old fuser made this assumption too), but we should be able to support 64. ghstack-source-id: 136933272 Test Plan: unit tests; new IR level test with huge sizes Reviewed By: ZolotukhinM Differential Revision: D30596689 fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94	2021-08-28 19:59:47 -07:00
Mikhail Zolotukhin	f0d274294d	[TensorExpr] Nuke KernelArena and KernelScope. (#63587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587 Now that there is no classes using KernelArena for memory management we can remove it. Differential Revision: D30429115 D30429115 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544	2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin	62d02f2b57	[TensorExpr] Make 'Tensor' a value type. (#63586 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586 This is another commit in transition from KernelArena memory management. Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need to dynamically allocate it at all - it's cheap to pass it by value, and that's what we're switching to in this commit. After this change nothing uses KernelScope/KernelArena and they can be safely removed. Differential Revision: D30429114 D30429114 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819	2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin	dd96c26066	[TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778 This is a preparation for a switch from raw pointers to shared pointers as a memory model for TE expressions and statements. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30487425 Pulled By: ZolotukhinM fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c	2021-08-24 00:30:49 -07:00
Mikhail Zolotukhin	7fdba4564a	[TensorExpr] IRSimplifier: sort terms in polynomials, terms, minterms, maxterms. (#63197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197 This solves non-determinism from using hash values in sort methods. Changes in tests are mostly mechanical. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30292776 Pulled By: ZolotukhinM fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055	2021-08-18 14:49:27 -07:00
Mikhail Zolotukhin	1dc2b52764	[TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195 This helps us to later switch from using KernelArena with raw pointers to shared pointers without having to change all our source files at once. The changes are mechanical and should not affect any functionality. With this PR, we're changing the following: * `Add` --> `AddPtr` `new Add(...)` --> `alloc<Add>(...)` * `dynamic_cast<Add>` --> `to<Add>` `static_cast<Add>` --> `static_to<Add>` Due to some complications with args forwarding, some places became more verbose, e.g.: `new Block({})` --> `new Block(std::vector<ExprPtr>())` Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30292779 Pulled By: ZolotukhinM fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9	2021-08-17 13:44:45 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Bert Maher	b963607d50	[nnc] Insert alloc/free at global scope (#61725 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61725 Alloc/free inside a loop isn't really an optimization, and furthermore it breaks some attempted optimization in the llvm backend: we use alloca for small allocations, which is efficient since alloca is on the stack, but there's no corresponding free, so we leak tons of stack. I hit this while building an rfactor buffer inside a very deeply nested loop. ghstack-source-id: 133627310 Test Plan: Unit test which simulates use of a temp buffer in a deeply nested loop. Reviewed By: navahgar Differential Revision: D29533364 fbshipit-source-id: c321f4cb05304cfb9146afe32edc4567b623412e	2021-07-16 08:42:24 -07:00
Raghavan Raman	b822928e33	[nnc] Removed setGPUBlockIndex and setGPUThreadIndex methods from LoopNest (#59495 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59495 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28915960 Pulled By: navahgar fbshipit-source-id: 20a4032b031aba6e43d85433ade5f0680c65fbc0	2021-06-15 10:37:46 -07:00
Raghavan Raman	aa163aeff5	[nnc] Made several LoopNest APIs static (#59494 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59494 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28915959 Pulled By: navahgar fbshipit-source-id: bf52e30d893f4d86812219b538a14307f347f10b	2021-06-15 10:36:31 -07:00
Raghavan Raman	30e24b2d2b	[nnc] Modified vectorize API to return bool (#59422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59422 Test Plan: Imported from OSS Reviewed By: huiguoo Differential Revision: D28886980 Pulled By: navahgar fbshipit-source-id: 58cc3ecd86564a312a132f8260d836b096505095	2021-06-11 12:02:19 -07:00
Mikhail Zolotukhin	daa35141e8	Reland: "[TensorExpr] Fix handling of 0-dim tensors." (#59508 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508 An assert that was triggering in a previous version is now relaxed to take 0-dim tensors into account. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D28918342 Pulled By: ZolotukhinM fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae	2021-06-08 22:48:17 -07:00
Nikita Shulga	ba3a90b55e	Revert D28819780: [TensorExpr] Fix handling of 0-dim tensors. Test Plan: revert-hammer Differential Revision: D28819780 Original commit changeset: f3feff35a1ce fbshipit-source-id: 1dca4ac9cea0b67e9f02800f6d5b3c7e4ae1d81a	2021-06-04 19:25:30 -07:00
Mikhail Zolotukhin	d60efd8207	[TensorExpr] Fix handling of 0-dim tensors. (#59279 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59279 There were some issues with how we handle 0-dim cases in lowerings and also in how we generate reductions in that special case. This PR fixes those issues and reenables a bunch of tests. Differential Revision: D28819780 D28819780 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: f3feff35a1ce11821ada2f8d04ae9d4be10dc736	2021-06-04 13:58:15 -07:00
Hui Guo	7c4ac9e3ee	[NNC] Fix loopnest.cache_accesses for reduce ops (fixed #59002 ) (#59136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59136 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28768598 Pulled By: huiguoo fbshipit-source-id: 99ab8430bc0ba395e2a041b03a7761de335ddda5	2021-06-03 21:04:14 -07:00
Raghavan Raman	dd7bbe1a63	[NNC] Make splitWithMask transform in-place (#58269 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58269 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28427227 Pulled By: navahgar fbshipit-source-id: 4e38a436abcf4752fd7ef6ab3666876eec6ea5ba	2021-05-25 11:32:51 -07:00
Raghavan Raman	e2467cc43e	[NNC] Make splitWithTail transform in-place (#58268 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28427228 Pulled By: navahgar fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f	2021-05-25 11:31:14 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Mikhail Zolotukhin	7ab654afd7	[TensorExpr] Rename `Tensor::call` to `Tensor::load` to be consistent with `Buf` and `Placeholder`. (#55826 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55826 It's a mechanical change. Differential Revision: D27717777 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: fbc1bb99602250c706cf2c8c2684119c323e4d51	2021-04-13 12:08:53 -07:00
Mikhail Zolotukhin	b01a15d3d3	[TensorExpr] Redesign Rfactor loopnest transformation. (#55324 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324 With this change `rfactor` only affects the passed loop and its body never touching anything outside (that was a rootcause of a bug with the previous implementation). Also, we don't have an `insertion_point` parameter anymore - its meaning was vague, and the effect of it should've been achievable with other transformations anyway. The new `rfactor` semantics is as follows: ``` Requirements: * S is the reduction store * S is the only statement in the innermost loop * There is at least two reduction arguments in S * OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable used in the store and all other reduction variables are index variables of children loops of OUTER_REDUCTION_FOR * OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops corresponding to the other reduction variables and the store, nested into each other What it does: * Introduce a new buffer with an extra dimension of a size equal to the span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via RFAC_BUF_PTR) * Insert an initialization store for the new buffer in OUTER_REDUCTION_FOR before its nested loop * Replace the reduction store to the original buffer with the reduction store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR from reduction arguments * Insert a final reduction store over the extra dimension of the new buffer to the original buffer * Returns TRUE if the transformation succeeded and FALSE otherwise Example: Original IR: S1: for i # normal axis S2: X[i] = 0 S3: for j # reduction axis S4: for k # reduction axis S5: X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k}) After RFACTOR(S5, S3) S1: for i # normal axis S2: X[i] = 0 S3: for j # reduction axis for X, normal axis for X_rfac X_rfac[i,j] = 0 S4: for k # reduction axis X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k}) X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j}) ``` Differential Revision: D27694960 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c	2021-04-13 12:08:48 -07:00
Joel Schlosser	defc649eca	Update to short forms of splitWithTail / splitWithMask (#55542 ) Summary: Switched to short forms of `splitWithTail` / `splitWithMask` for all tests in `test/cpp/tensorexpr/test_*.cpp` (except test_loopnest.cpp) Pull Request resolved: https://github.com/pytorch/pytorch/pull/55542 Reviewed By: mrshenli Differential Revision: D27632033 Pulled By: jbschlosser fbshipit-source-id: dc2ba134f99bff8951ae61e564cd1daea92c41df	2021-04-09 10:15:20 -07:00
Mikhail Zolotukhin	688e350725	[TensorExpr] Nuke DepTracker and findAllNeededTensors. (#54997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997 DepTracker was used to automatically pull in dependent computations from output ones. While it seems quite convenient, it's led to several architectural issues, which are fixed in this stack. DepTracker worked on Tensors, which is a pair of Buf and Stmt. However, Stmt could become stale and there was no way to reliably update the corresponding tensor. We're now using Bufs and Stmts directly and moving away from using Tensors to avoid these problems. Removing DepTracker allowed to unify Loads and FunctionCalls, which essentially were duplicates of each other. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D27446414 Pulled By: ZolotukhinM fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399	2021-04-01 19:46:26 -07:00
Bert Maher	997f05cd34	[nnc] Add an initialization expression to Reduce() (#53751 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53751 Sometimes the initial value of a reduction expression needs to be computed with reference to the loop axes; for example, adding bias can be efficiently represented by initializing the accumulator from the bias tensor: ``` C[n, c, h, w] = bias[c] for (...) C[n, c, h, w] += ... ``` ghstack-source-id: 123592861 Test Plan: `buck test //caffe2/test/cpp/tensorexpr -- Reductions.InitFunction` Reviewed By: navahgar Differential Revision: D26940321 fbshipit-source-id: 8a08e19e5d0b9ad453a07fab8b61e75dcd3d626b	2021-03-10 17:13:14 -08:00
Hui Guo	973e306c84	changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167 Test Plan: Imported from OSS `python test/test_jit_fuser_te.py` `python test/test_jit_fuser_legacy.py` `python test/test_jit_fuser.py` `build/bin/test_tensorexpr` Reviewed By: ZolotukhinM Differential Revision: D25814342 Pulled By: huiguoo fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee	2021-02-22 15:08:51 -08:00
Bert Maher	ac121165e2	Remove ReduceOp::accumulator (#52196 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52196 A reduction does not need to know the buffer into which its result will be written. This change gets us closer to being able to create reductions inside Compute, where we have access to the tensor axes. ghstack-source-id: 121813071 Test Plan: test_tensorexpr Reviewed By: ZolotukhinM Differential Revision: D26420107 Pulled By: bertmaher fbshipit-source-id: c8d8a99649adfd6de56fe53a728f5aa034a84f13	2021-02-17 23:36:23 -08:00
Bert Maher	a788c2d777	[nnc] Remove output_args from ReduceOp (#52187 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52187 ReduceOp doesn't need to track the indices that its result will be written into. ghstack-source-id: 121813075 Test Plan: test_tensorexpr, tensorexpr_bench Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D26420575 fbshipit-source-id: 7afcfa611515334e36de8039722011687f3b61e4	2021-02-17 23:36:18 -08:00
Mikhail Zolotukhin	e975169426	[TensorExpr] Redesign `Tensor` class. (#50995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995 This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and merges it with recently introduced 'CompoundTensor'. A statement for the tensor is either passed directly to the Tensor constructor (akin to 'CompoundTensor'), or is built immediately in constructor. LoopNest is no longer responsible for constructing statements from tensors - it simply stitches already constructed statements contained in Tensors. This has a side effect that now we cannot construct several loopnests from the same tensors - we need to explicitly clone statements if we want to do that. A special copy constructor was added to LoopNest to make it more convenient (note: this only affects tests, we don't usually create multiple loopnests in other places). Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D26038223 Pulled By: ZolotukhinM fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17	2021-01-27 16:14:22 -08:00
Mikhail Zolotukhin	a5b27d7a31	[TensorExpr] Move `SimpleIREval` implementation from .h to .cpp. (#49697 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49697 Mostly mechanical move. This refactoring helps to hide unnecessary details from the SimpleIREval interface and make it more similar to a pure 'codegen'. Test Plan: Imported from OSS Reviewed By: nickgg Differential Revision: D25668696 Pulled By: ZolotukhinM fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c	2020-12-21 20:20:15 -08:00
Nick Gibson	db2e9c1e7f	[NNC] Intermediate allocs flattened and dependency support (#49554 ) Summary: Makes two changes in NNC for intermediate buffer allocations: 1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages. 2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff). I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/49554 Reviewed By: VitalyFedyunin Differential Revision: D25643133 Pulled By: nickgg fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c	2020-12-21 10:35:15 -08:00
Bert Maher	07657b6001	[tensorexpr] Switch cpp tests to pure gtest (#48160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160 We no longer use the custom c++ test infra anyways, so move to pure gtest. Fixes #45703 ghstack-source-id: 116977283 Test Plan: `buck test //caffe2/test/cpp/tensorexpr` Reviewed By: navahgar, nickgg Differential Revision: D25046618 fbshipit-source-id: da34183d87465f410379048148c28e1623618553	2020-11-18 12:23:34 -08:00
Nick Gibson	957e45a97c	[NNC] Support vectorization of reductions (#47924 ) Summary: Add support for ReduceOp in the Vectorizer, which allows vectorization of reductions. Only non-reduce axes can be vectorized currently, we'd need either automatically pulling out the RHS of reductions (better as a separate transform, I think) or special handling of vector reduce in the LLVM codegen (tricky, maybe not useful?) to make vectorizing reduce axes work. There was a disabled LLVM test for this case which I reenabled with a bit of massaging, and added a few more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47924 Reviewed By: bertmaher Differential Revision: D24963464 Pulled By: nickgg fbshipit-source-id: 91d91e9e2696555ab5690b154984b1ce48359d51	2020-11-16 10:43:53 -08:00
Raghavan Raman	f58842c214	Enable inlining into reductions (#47020 ) Summary: This diff enables inlining producers into reductions. It also guards against inlining reductions themselves. Prior to this diff, if there was a reduction in the loopnest, no inlining was happening. After this change, we will inline all non-output buffers that do not correspond to a reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47020 Reviewed By: albanD Differential Revision: D24644346 Pulled By: navahgar fbshipit-source-id: ad234a6877b65be2457b734cbb7f3a1800baa6a5	2020-11-02 15:33:38 -08:00
Nick Gibson	402abdfdf4	[NNC] cacheAccesses transform (cache_reads + cache_writes) (#45869 ) Summary: Adds a new transform to the NNC compiler, which adds support for buffer access caching. All accesses within a provided scope are redirected to a cache which is initialized or written back as necessary at the boundaries of that scope. For TVM fans, this is essentially a combination of cache_reads and cache_writes. E.g. it can do this kind of thing: Before: ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int j_1 = 0; j_1 < 10; j_1++) { B[i_1, j_1] = (A(i_1 + 30, j_1 + 40)) + (A(i_1 + 31, j_1 + 41)); } ``` After `cacheAccesses(A->buf(), "A_local", j_loop);` ``` for (int i = 0; i < 64; i++) { for (int j = 0; j < 64; j++) { A[i, j] = i * j; } } for (int i_1 = 0; i_1 < 20; i_1++) { for (int i_2 = 0; i_2 < 2; i_2++) { for (int j_1 = 0; j_1 < 11; j_1++) { A_local[i_2, j_1] = A[(i_2 + i_1) + 30, j_1 + 40]; } } for (int j_2 = 0; j_2 < 10; j_2++) { B[i_1, j_2] = (A_local[1, j_2 + 1]) + (A_local[0, j_2]); } } ``` Or this reduction: ``` for (int l1 = 0; l1 < 4; l1++) { sum[l1] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { sum[l1] = (sum[l1]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } } ``` After `l.cacheAccesses(d->buf(), "d_local", n_loop);`: ``` for (int l1 = 0; l1 < 4; l1++) { Allocate(d_local, float, {1}); sum[l1] = 0.f; d_local[0] = 0.f; for (int n1_1 = 0; n1_1 < 3; n1_1++) { for (int m1_1 = 0; m1_1 < 2; m1_1++) { d_local[0] = (d_local[0]) + (scale[(6 * l1 + 2 * n1_1) + m1_1]); } } sum[l1] = (sum[l1]) + (d_local[0]); Free(d_local); } ``` I had originally planned to write `cacheReads` and `cacheWrites` wrappers so we could use them just like their TVM cousins, but they just ended up being big masses of checking that reads or writes weren't present. Didn't feel too useful so I removed them, but let me know. This is based on bounds inference and inherits a few bugs present in that functionality, which I will address in a followup. While working on this I realized that it overlaps heavily with `computeAt`: which is really just `cacheReads` + `computeInline`. I'm considering refactoring computeAt to be a wrapper around those two transforms. ZolotukhinM opinions on this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/45869 Reviewed By: mruberry Differential Revision: D24195276 Pulled By: nickgg fbshipit-source-id: 36a58ae265f346903187ebc4923637b628048155	2020-10-08 14:13:28 -07:00
Mikhail Zolotukhin	4aca63d38a	[TensorExpr] Change API for creating Load and Store expressions. (#45520 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520 With this change `Load`s and `Store`s no longer accept `Placeholder`s in their constructor and `::make` functions and can only be built with `Buf`. `Placeholder` gets its own `store`, `load`, `storeWithMask`, and `loadWithMask` method for more convenient construction. Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D23998789 Pulled By: ZolotukhinM fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912	2020-09-29 20:52:38 -07:00
Mikhail Zolotukhin	b86008ab75	[TensorExpr] Remove buf_ field from class Tensor. (#45390 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45390 Tensor objects should always refer to their Function's bufs. Currently we never create a Tensor with a buffer different than of its function, but having it in two places seems incorrect and dangerous. Differential Revision: D23952865 Test Plan: Imported from OSS Reviewed By: nickgg Pulled By: ZolotukhinM fbshipit-source-id: e63fc26d7078427514649d9ce973b74ea635a94a	2020-09-29 01:21:57 -07:00
Mikhail Zolotukhin	3c33695a6d	[TensorExpr] Rename `Buffer` to `Placeholder`. (#45389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389 Differential Revision: D23952866 Test Plan: Imported from OSS Reviewed By: nickgg Pulled By: ZolotukhinM fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75	2020-09-29 01:21:54 -07:00
Mikhail Zolotukhin	92306b85d5	[TensorExpr] Consolidate {buffer,function,tensor}.{h.cpp} in tensor.{h,cpp}. (#45388 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45388 Classes defined in these files are closely related, so it is reasonable to have them all in one file. The change is purely a code move. Differential Revision: D23952867 Test Plan: Imported from OSS Reviewed By: nickgg Pulled By: ZolotukhinM fbshipit-source-id: 12cfaa968bdfc4dff00509e34310a497c7b59155	2020-09-29 01:17:10 -07:00
Nick Gibson	69839ea3f6	[NNC] make inlining immediate (take 3) (#44231 ) Summary: This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context. The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it. I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231 Reviewed By: albanD Differential Revision: D23689688 Pulled By: nickgg fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9	2020-09-15 11:12:24 -07:00
Mikhail Zolotukhin	6474057c76	Revert D23503636: [pytorch][PR] [NNC] make inlining immediate (take 2) and fix bugs Test Plan: revert-hammer Differential Revision: D23503636 (`70aecd2a7f`) Original commit changeset: cdbdc902b7a1 fbshipit-source-id: b5164835f874a56213de4bed9ad690164eae9230	2020-09-04 10:58:23 -07:00
Nick Gibson	70aecd2a7f	[NNC] make inlining immediate (take 2) and fix bugs (#43885 ) Summary: A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two. This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches. This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs). This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list: * When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body. * When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined. * `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885 Reviewed By: gmagogsfm Differential Revision: D23503636 Pulled By: nickgg fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa	2020-09-03 16:49:24 -07:00
Mikhail Zolotukhin	3105d8a9b2	[TensorExpr] Fuser: rely on input types when checking whether a device is supported. (#44139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44139 Also, make sure that we're checking that condition when we're starting a new fusion group, not only when we merge a node into an existing fusion group. Oh, and one more: add a test checking that we're rejecting graphs with unspecified shapes. Differential Revision: D23507510 Test Plan: Imported from OSS Reviewed By: bertmaher Pulled By: ZolotukhinM fbshipit-source-id: 9c268825ac785671d7c90faf2aff2a3e5985ac5b	2020-09-03 16:27:14 -07:00
generatedunixname89002005287564@sandcastle1415.cln1.facebook.com	1dd658f28f	[Codemod][GleanFbcode] Remove dead includes in caffe2/test (#43953 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43953 Reviewed By: malfet Differential Revision: D23445556 fbshipit-source-id: 89cd6833aa06f35c5d3c99d698abb08cd61ae4ab	2020-09-01 21:48:28 -07:00
Nick Gibson	7ffdd765c8	[TensorExpr] more convenient outer Rfactor output (#40050 ) Summary: Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes. An example: ``` Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}}); LoopNest loop({c}); std::vector<For> loops = loop.getLoopStmtsFor(c); auto v = loops.at(0)->var(); loop.rfactor(c->body(), v); ``` Before: ``` { Allocate(tmp_buf, float, {m}); sum[0] = 0.f; for (int m_1 = 0; m_1 < m; m_1++) { tmp_buf[m_1] = 0.f; } for (int m_1 = 0; m_1 < m; m_1++) { for (int n = 0; n < n_1; n++) { for (int k = 0; k < k_1; k++) { tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 m_1) * k_1 + k) + k_1 * n]); } } } for (int m_1 = 0; m_1 < m; m_1++) { sum[0] = (sum[0]) + (tmp_buf[m_1]); } Free(tmp_buf); } ``` After: ``` { sum[0] = 0.f; for (int m = 0; m < m_1; m++) { Allocate(tmp_buf, float, {m_1}); tmp_buf[m] = 0.f; for (int n = 0; n < n_1; n++) { for (int k = 0; k < k_1; k++) { tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]); } } sum[0] = (sum[0]) + (tmp_buf[m]); Free(tmp_buf); } } ``` The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050 Reviewed By: ZolotukhinM Differential Revision: D22605639 Pulled By: nickgg fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0	2020-07-21 14:44:26 -07:00
Nick Gibson	2ed4ed8733	[TensorExpr] Fix two bugs in Rfactor (#39268 ) Summary: The two bugs were: * Non-reduction axes were not added when inserting the new ReduceOp, meaning if a reduction with non-reduce axes was rfactored we'd produce bad outputs. There were no tests of Rfactor with non-reduce axis so I modified a test to do this. * The new statements were always prepended to the block, meaning writes to a buffer could be reordered after the usage of that buffer. This mostly happened in the case where we rfactor a previously rfactored reduction. There was a test of this, but since it only tested rfactoring the outer reduction axis there was never any other statements at the insertion point (the tests of the insertion point argument also do this). I added a new test which covers various rfactor-axis cases. Also cleaned up tests, removed some helper code we don't need etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39268 Differential Revision: D21864489 Pulled By: nickgg fbshipit-source-id: d314d20997a8472ec96b72f7a9068d6da6d2399c	2020-06-03 14:38:34 -07:00
Nick Gibson	cf8001d2d0	[TensorExpr] Fix a bug in Rfactor when there are multiple reductions (#38733 ) Summary: In `LoopNest::rfactor` we assume that there is only a single reduction below the insertion point, and when replacing the reduction we recursively replace all reductions below that point. This is not a safe assumption, as a number of transformations can introduce additional ReduceOps - most directly a `splitWithTail` on the innermost reduce axis. This PR fixes that bug, and adds some unit tests covering the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38733 Differential Revision: D21723634 Pulled By: nickgg fbshipit-source-id: 3ed6ffcdc2c15aef7504f9b2b91e8d827e0b5d88	2020-05-27 16:49:34 -07:00
Nick Gibson	2f21dfb541	[TensorExpr] Eager reduction initialization & removal from ReduceOp (#38585 ) Summary: This PR removes the deferred initializer field from ReduceOp in favour of eagerly initializing buffers when they are created (either in the constructor of `LoopNest`, or in `rfactor()`). This allows a pretty good simplification of reduction logic, removing almost all of the reduction expander and the ReduceInitCleaner & unpopular NoOp node added in the last fix. Eager initialization is better for us anyway because it allows more opportunities to transform the initialization loop. Added a few more tests, testReduceOverSplitWithTail failed before this change due to a bug in splitWithTail which now can't happen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38585 Differential Revision: D21621551 Pulled By: nickgg fbshipit-source-id: 378137e5723b4a6d6e390239efb12adce22a8215	2020-05-18 15:56:43 -07:00
Nick Gibson	8bf3124572	[TensorExpr] Fix bug when splitting inner reduce axis with tail (#38420 ) Summary: Fixes a bug in the following code: ``` Tensor* c = Reduce("sum", {{10, "m"}}, Sum(), b, {{10, "n"}, {10, "k"}}); // split N loop with tail: loop.splitWithTail(loop.getLoopStmtsFor(c)[1], 8, &outer, &inner, &tail); ``` When this is expanded there are two ReduceOps: ``` for (int m = 0; m < 10; m++) { for (int n_outer = 0; n_outer < (10 - 0) / 8; n_outer++) { for (int n_inner = 0; n_inner < 8; n_inner++) { for (int k = 0; k < 10; k++) { sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_outer * 8 + n_inner, k]), out_args={m}, reduce_args={n_inner, n_outer, k}); } } } for (int n_tail = 0; n_tail < (10 - 0) % 8; n_tail++) { for (int k = 0; k < 10; k++) { sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_tail + ((10 - 0) / 8) * 8, k]), out_args={m}, reduce_args={n_tail, k}); } } } ``` But each ReduceOp will expand it's initializer, which in this case will overwrite the sum of the split loop: ``` for (int m = 0; m < 10; m++) { sum[m] = 0.f; for (int n_inner = 0; n_inner < 8; n_inner++) { for (int k = 0; k < 10; k++) { sum[m] = (sum[m]) + (b[(100 * m + k) + 10 * n_inner]); } } sum[m] = 0.f; <------- HERE for (int n_tail = 0; n_tail < 2; n_tail++) { for (int k = 0; k < 10; k++) { sum[m] = (sum[m]) + (b[((100 * m + k) + 10 * n_tail) + 80]); } } } ``` The simplest fix is to remove the initializer from the tail loop, which requires adding support for Reductions without an initializer (I did via adding a NoOp Expr rather than handling nullptr). Also moved the ReductionExpander from loopnest.cpp to reduction.h as loopnest is getting a bit heavy. Added tests for all kinds of splits on a simple 3D reduction to verify no more problems of this type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38420 Differential Revision: D21587583 Pulled By: nickgg fbshipit-source-id: e0766934481917007119612eb60cc76c3242e44a	2020-05-14 22:58:28 -07:00
Nick Gibson	4e2ea6e013	[TensorExpr] Remove the Tensor argument from loopnest.reorderAxis (#37873 ) Summary: Remove the requirement for the axes provided to reorderAxis to come from a Tensor. We were using that to determine the relevant loops, but we can alternatively determine it by traversing the parents of each provided For. resistor does this work for you? Pull Request resolved: https://github.com/pytorch/pytorch/pull/37873 Differential Revision: D21428016 Pulled By: nickgg fbshipit-source-id: b16b2f41cb443dfc2c6548b7980731d1e7d89a35	2020-05-06 12:02:15 -07:00

1 2

54 Commits