Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36745
As we hold a mutex for our custom C++ Node, when calling reentrant
backward from custom C++ function, we will cocurrently holding many
mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise
it will complain. This PR lower the limit according to TSAN.
TSAN Reference: https://github.com/google/sanitizers/issues/950
Test Plan: Imported from OSS
Differential Revision: D21072604
Pulled By: wanchaol
fbshipit-source-id: 99cd1acab41a203d834fa4947f4e6f0ffd2e70f2
Summary:
Adds a capability for reordering axes in the LoopNest. This was fairly straightforward except when handling Reduction initializers which required more changes, UPDATE: actually the complicated bit was preserving the ordering of statements in the loopnest which should not be reordered.
Usage looks something like this:
```
Tensor* tensor = Compute(
"f", {{2, "x"}, {3, "y"}}, [](const VarHandle& x, const VarHandle& y) {
return ExprHandle(1.0f) + cast<float>(x) * x + cast<float>(y) * y;
});
LoopNest l({tensor});
/* LoopNest looks like:
for x in ...
for y in ...
f[x,y] = 1 + x * x + y * y;
*/
auto loops = l.getLoopStmtsFor(tensor);
l.reorderAxis(tensor, loops[0], loops[1])
/* LoopNest looks like:
for y in ...
for x in ...
f[x,y] = 1 + x * x + y * y;
*/
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36540
Differential Revision: D21068143
Pulled By: nickgg
fbshipit-source-id: f02c29004376df4f5a9bedff366c075772726618
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36729
setenv not available on windows
Test Plan: CI green in ovrsource
Reviewed By: stepancheg
Differential Revision: D21067835
fbshipit-source-id: ddbc3285ef88f123dc6a200b661c48cfafc6bf00
Summary:
Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is:
https://godbolt.org/z/i0uAv3
What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36435
Reviewed By: ZolotukhinM
Differential Revision: D21024011
Pulled By: soumith
fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0
Summary:
Simplifies loops which can be collapsed down into a single block or removed entirely. E.g.
```
For 0..1 {
Statements...
}
```
Is now just `Block({Statements...})`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36348
Differential Revision: D21057959
Pulled By: nickgg
fbshipit-source-id: 2f95a19a965c4a6e023680e2cea9ea846e82d62e
Summary:
With https://github.com/pytorch/pytorch/pull/35562, we are running peephole optimization on inlining to reduce the number of nodes that are copied.
The tracer encodes the sizes in the graph like:
```
graph(%0 : Double(7)):
%1 : Function = prim::Constant[name="tensor_size"]()
%2 : Tensor = prim::CallFunction(%1, %0)
return (%2)
```
however people would like to reuse the graph with different shapes so running size invalidations would invalidate that. long term it might be better for the tracer to not include shape information but there are downstream users of that.
Separates out FuseAddMM from peephole so that now there is a single `disable_size_optimizations` parameter, and onnx explicitly invokes fuseaddmm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36404
Differential Revision: D20968974
Pulled By: eellison
fbshipit-source-id: 56f8f1699e3b0adeeccdfd5a67bb975fd41a2913
Summary:
LLVM Codegen assumes that the kernel contains real statements, but that is not guaranteed, especially after IR Simplification. This PR adds a catch for the case where no value is generated after recursing the LLVMCodegen visitor through the kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36660
Differential Revision: D21044066
Pulled By: nickgg
fbshipit-source-id: e521c766286b1ff4e26befcec7ff4959db8181a4
Summary:
Second attempt at the reduction frontend for the TensorExpr compiler. Has two APIs, a simple version for common reduction types and a customizable Reducer fronted which allows specifying initializer, reduction interaction via lambda and body via lambda.
Simple API looks like so:
```
Buffer b(BufHandle("b", {10}), kInt);
Tensor* c = Reduce("sum", {}, Sum(b), {{10, "m"}});
```
An example of specializing a Sum to do Matmul:
```
Buffer tA(BufHandle("tA", {M, K}), kFloat);
Buffer tB(BufHandle("tB", {K, N}), kFloat);
Sum matmul([&](ParameterList& v) {
ExprHandle m = v[0];
ExprHandle n = v[1];
ExprHandle k = v[2];
return tA(m, k) * tB(k, n);
});
Tensor* mm = Reduce("mm", {{M, "m"}, {N, "n"}}, matmul, {{K, "k"}});
```
A fully specialized Reduction:
```
VarHandle searchValue("searchValue", kInt);
Buffer b(BufHandle("b", {4, 10}), kInt);
Reducer anyEqSV(
ExprHandle(0),
[](ExprHandle a, ExprHandle b) {
return CompareSelect::make(a, 1, 1, b, kEQ);
},
[&](ParameterList& v) {
return CompareSelect::make(b.call(v), searchValue, kEQ);
});
Tensor* any = Reduce("anyEqual", {{4, "i"}}, anyEqSV, {{10, "j"}});
```
---
Until lowering, Reductions are held in a compound form for easier optimization:
```
VarHandle m("m", kInt);
Buffer b(BufHandle("b", {2, 3, m}), kFloat);
Tensor* c = Reduce("sum", {{2, "l"}, {3, "n"}}, Sum(b), {{m, "m"}});
LoopNest loop({c});
std::cout << *loop.root_stmt() << "\n";
```
```
for (int l = 0; l < 2; l++) {
for (int n = 0; n < 3; n++) {
for (int m = 0; m < m_1; m++) {
sum[l, n] = ReduceOp(sum[l, n] = float(0);, (sum[l, n]) + (b[l, n, m]), {m});
}
}
}
```
```
loop.prepareForCodegen();
std::cout << *loop.root_stmt() << "\n";
```
```
for (int l = 0; l < 2; l++) {
for (int n = 0; n < 3; n++) {
sum[(0 + l * (1 * 3)) + n * 1] = float(0);
for (int m = 0; m < m_1; m++) {
sum[(0 + l * (1 * 3)) + n * 1] = (sum[(0 + l * (1 * 3)) + n * 1]) + (b[((0 + l * ((1 * m_1) * 3)) + n * (1 * m_1)) + m * 1]);
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35866
Differential Revision: D20965577
Pulled By: nickgg
fbshipit-source-id: afe506c90db794447180056417013bcaf0e2c049
Summary:
Adds handling of constant branches to the TensorExpr IR Simplifier. This covers both IfThenElse and Cond when the condition expression is a known constant (e.g. `IfThenElse(1, X, Y) => X`), or when both arms of the branch are the same (e.g. `IfThenElse(Y, X, X) => X`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36257
Differential Revision: D20947777
Pulled By: nickgg
fbshipit-source-id: 974379e42a6d65ce3e7178622afb62d36ad4e380
Summary:
This PR completely refactors the code lowering process from our IR to CUDA. Before we had one giant step that would go from a relatively high level IR straight to CUDA, now we're lowering this first into concepts like ForLoop, IfThenElse, TensorIndex, Allocate. This lowering will allow us to do more complex code lowering like reductions and unrolling. Unrolling will quickly follow this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36199
Reviewed By: dzhulgakov
Differential Revision: D20925220
Pulled By: soumith
fbshipit-source-id: 8f621c694c68a1aad8653e625d7287fe2d8b35dc
Summary:
In the IR Simplifier we were not treating multiply by zero specially, which meant some constant expressions were stored in formats that were not constant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36287
Differential Revision: D20937497
Pulled By: nickgg
fbshipit-source-id: 528e430313ea048524d7a4a0256eef4a0297438b
Summary:
Add support for the TensorExpr IR Simplifier to factorize common terms on either side of a Div node. e.g. `(8 * x) / (4 * y) => (2 * x) / y`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36154
Differential Revision: D20910580
Pulled By: nickgg
fbshipit-source-id: ee071d93bc4711b1e710be312de599d18ab506f3
Summary:
This supersedes https://github.com/pytorch/pytorch/pull/35698.
`abs` is a C-style function that takes only integral argument
`std::abs` is polymorphic and can be applied to both integral and floating point types
This PR also increases `kBatchSize` in `test_optimizer_xor` function in `test/cpp/api/optim.cpp` to fix `OptimTest.XORConvergence_LBFGS` failure under ASAN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35974
Test Plan: CI
Reviewed By: pbelevich
Differential Revision: D20853570
Pulled By: yf225
fbshipit-source-id: 6135588df2426c5b974e4e097b416955d1907bd4
Summary:
Just run `./tools/clang_format.py --verbose` and `git commit --all`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35969
Test Plan: CI
Differential Revision: D20845626
Pulled By: malfet
fbshipit-source-id: 0ae9a91dfa33417a021e7e9d233baba4188daf81
Summary:
This enables the serialization part of this change (the deserialization stuff is already landed #33255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35741
Pulled By: driazati
Differential Revision: D20758124
fbshipit-source-id: e2cdefa99c3bec991491e5e967e7f1661ca7ffd9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35800
This PR includes the following changes:
* Introduce a new `Expr` type `Buf`: it plays a similar to `Var` role, but also has dimensions.
* Use the new `Buf` class in `Store` and `Load` instead of `Var` for specifying where to store to or load from. `Buf` contains the dimensions info of the buffer we're loading/storing to and hence we are able to keep N-d indexes without flattening them into a 1-d index ([x,y] vs [x+y*W]).
* Flattening of the indexes is now a separate pass that is executed in `LoopNest::prepareForCodegen` - backends still expect indexes to be flattened, and this PR preserves that.
* `Tensor` now contains a `Buf` instead of `Var`, and thus Tensor now has the dimensions info (previously it was a property of a `Function`, not a `Tensor`). This brings us closer to Tensor being a combination of Buffer + Function, where Buffer specifies iteration domain and the Function defines a computation.
TODOs:
* Consider merging `Buffer` with `Buf` or `BufHandle`. It seems that we don't need all of them.
* Harden the logic of how we create buffers in fuser pass. Currently it seems that sometimes we don't set dimensions.
* Use `Buf` in `Allocate` and `Free`.
* Make it clearer that `Function` doesn't "own" dimensions info and that dimensions are a property of a Tensor, not a Function.
Differential Revision: D20789005
Test Plan: Imported from OSS
Reviewed By: zheng-xq
Pulled By: ZolotukhinM
fbshipit-source-id: e04188d1d297f195f1c46669c614557d6bb6cde4
Summary:
**Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated.
**Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser.
**Short term goals:**
Parity with current CUDA fuser (including performance):
- Dynamic shapes (no recompilation)
- Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code)
- Dropout
**Mid-term goals:**
- Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation).
- 1-D reductions fused with pointwise operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785
Reviewed By: ZolotukhinM
Differential Revision: D20650977
Pulled By: soumith
fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
Summary:
Adds capabilities to the TensorExpr IR Simplifier to simplify down Round + Mod patterns (e.g. `(x/y)*y + x%y => x`) via means of lifting integer rounding into a temporary `RoundOff` node.
This integrates with existing simplification mechanisms (folding, factorization, reordering, etc) to allow simplification of compound expressions: e.g. `20 * (x / (16 / 2)) * 2 + (11 % 6) * (x % (7+1)) => 5 * x.`.
Tests: ran tensorexpr cpp and python tests, ran a hpc benchmark and verified results and time didn't regress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35683
Differential Revision: D20811316
Pulled By: nickgg
fbshipit-source-id: 0cd6a517fb9548b3bc689768304b97375df5ac58
Summary: This diff fixes the issues with current handling of debug information passed along the execution of the model. (For example, it is possible that multiple calls to the debug guard may override each other)
Test Plan: CI test/cpp/jit
Reviewed By: dzhulgakov
Differential Revision: D20602775
fbshipit-source-id: 4683957954028af81a1a0f1f12b243650230c9bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34710
Extending RecordFunction API to support new recording scopes (such as TorchScript functions), as well as giving more flexibility to set sampling rate.
Test Plan: unit test (test_misc.cpp/testRecordFunction)
Reviewed By: gdankel, dzhulgakov
Differential Revision: D20158523
fbshipit-source-id: a9e0819d21cc06f4952d92d43246587c36137582
Summary:
https://github.com/pytorch/pytorch/pull/35127 was landed and reverted because I missed a test fail (oops). I have found and fixed the issue, which was due to zero terms being introduced after the point that filtered them out (usually required NAN/INF, e.g. x / INF => 0).
See https://github.com/pytorch/pytorch/pull/35127 for more info.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35415
Reviewed By: ZolotukhinM
Differential Revision: D20702957
Pulled By: nickgg
fbshipit-source-id: 119eb41e9fa676bd78e3d1df99297a47ae312185
Summary:
Ignore mixed upper-case/lower-case style for now
Fix space between function and its arguments violation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35574
Test Plan: CI
Differential Revision: D20712969
Pulled By: malfet
fbshipit-source-id: 0012d430aed916b4518599a0b535e82d15721f78
Summary:
1. Removed LossClosureOptimizer, and merged Optimizer into OptimizerBase (and renamed the merged class to Optimizer)
2. Merged the LBFGS-specific serialize test function and the generic test_serialize_optimizer function.
3. BC-compatibility serialization test for LBFGS
4. Removed mentions of parameters_ in optimizer.cpp, de-virtualize all functions
5. Made defaults_ optional argument in all optimizers except SGD
**TODO**: add BC-breaking notes for this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34957
Test Plan: Imported from GitHub, without a `Test Plan:` line.
Differential Revision: D20678162
Pulled By: yf225
fbshipit-source-id: 74e062e42d86dc118f0fbaddd794e438b2eaf35a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115
This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.
Testing:
Ran the script, CI.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D20568523
Pulled By: SplitInfinity
fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b
Summary:
1. Removed LossClosureOptimizer, and merged Optimizer into OptimizerBase (and renamed the merged class to Optimizer)
2. Merged the LBFGS-specific serialize test function and the generic test_serialize_optimizer function.
3. BC-compatibility serialization test for LBFGS
4. Removed mentions of parameters_ in optimizer.cpp, de-virtualize all functions
5. Made defaults_ optional argument in all optimizers except SGD
**TODO**: add BC-breaking notes for this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34957
Differential Revision: D20645945
Pulled By: yf225
fbshipit-source-id: 383588065bf1859b38f0ad0a25d93d41e153c96e