Summary:
Slightly modified Adam, following the python implementation, and the `ProducesPyTorchValues` tests pass. I had a problem with another test though (see commit c1a6241676ab84fc531c1c3a10f964aa5704092e), it seems that optimizing for two steps with the same optimizer vs optimizing for two steps using freshly initialized objects will produce the same output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40009
Differential Revision: D22096053
Pulled By: glaringlee
fbshipit-source-id: a31a8f5488cb37c53752ddf15436efabdba67dc4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39950
Per the comment in the code, constValue() should only be used in
the case where the future was complete and value was not an error.
Add an assert to enforce this.
Also, add hasValue() accessor for completeness.
ghstack-source-id: 105815597
Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit:
Differential Revision: D22021776
fbshipit-source-id: b59b6c775eab344068a76f4cd8c3a9dc1f2a174e
Summary:
Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch.
Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available.
Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic.
Offers both Python and ATen interfaces
Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683
Differential Revision: D21998093
Pulled By: ezyang
fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39867
Support list of filters in subgraph rewriter, the rewrite will execute only
when the match passes all filter check. this is useful for different matches
to share the same filter.
Test Plan: Imported from OSS
Differential Revision: D22009855
fbshipit-source-id: 67aab8d6326b2011a9061397699dc62ee9ad4e2d
Summary:
We've got quite a few things going on, preparing a push back to upstream so we don't get too desynced.
- Major refactor of transform replay. It is now far more robust and fixes bugs discovered in reductions. Preparing for extension to explicit broadcast ops which will be the last major memory pattern for op coverage. Broadcast ops will allow us to express up to and potentially beyond norms and gemms.
- Initial runtime expression evaluator. This allows us to evaluate expressions at runtime. Will be useful for determining our grid/block layout at runtime, so we don't have to manually compute them according to the code we're trying to generate.
- Moving to int64 and double for scalar representations to match PyTorch JIT.
- Improvements in codegen interface where we return Tensor like object instead of parent class Val.
- Add `addcmul` and `lerp` ops
- General updates, fixes, test additions, test inprovements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39579
Differential Revision: D21974001
Pulled By: soumith
fbshipit-source-id: 7f7ccc91593466e948f3ce90f8f9b7fbc5c28de2
Summary:
Fix another simplification edge case, a Cond statement when one branch is nullptr and the other is a zero stmt block. This happens mostly with an if with no else branch where all statements inside the if are removed (eg via inlining or simplification). Common case is SplitWithMask -> ComputeInline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39754
Differential Revision: D21962987
Pulled By: nickgg
fbshipit-source-id: 2461415466fbbab88d2329061f90fcfdfa85e243
Summary:
Clearly expressing a type is inferred by PyTorch instead of explicitly annotated by user makes many error messages more user-friendly
Currently Type has two string conversion methods. str() for IR printing and python_str() for serialization and error message generation. If we want to include more information in type printing while maintaining serialization/deserialization correctness, we need to split python_str() into annotation_str() and repr_str().
annotation_str is solely responsible for serialization, it strictly matches format of python type annotation. repr_str() is responsible for generating a human-readable error message that includes information like "this type is inferred, not explicitly annotated"
Closes https://github.com/pytorch/pytorch/issues/39449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39544
Differential Revision: D21978759
Pulled By: gmagogsfm
fbshipit-source-id: 733566f5a62e748b5ca4bb3c5943ebb6d5b664d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39497
Previously, we didn't consider side effects at all when moving nodes in alias analysis. It is never valid to reorder a node with a side effect. This has led to bugs when used with Bailouts.
Unfortunately this will might cause regressions but it wasn't correct prior :/
Test Plan: Imported from OSS
Differential Revision: D21963774
Pulled By: eellison
fbshipit-source-id: 656995d1b82534eca65437ed4e397b2bf08a4dec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39597
To complement collectAll(), this change adds collectAny(), and writes
up relevant unittest coverage.
We also remove the vector-based helper version of collectAll(), which
was debatable usefulness in a previous change.
ghstack-source-id: 105527180
Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit/...
Differential Revision: D21910311
fbshipit-source-id: dbb3ca404672a3d751b1b3cf016e6084a9ff8040
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39119
Add some base c++ unittest coverage for ivalue::Future, and in
the process, add a basic collectAll() primitive, per 38937.
In the process, I realized that List<Future> is effectively
impossible to construct (since the Future's type is not templated,
but rather passed in, the getTypePtr_<T>::call() isn't defined),
so added a workaround in List to make it possible.
ghstack-source-id: 105309650
Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit/...
Differential Revision: D21756884
fbshipit-source-id: 5d40c8d1c55098de5497655c7b887f4f56508a37
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39607
add overload name for strcmp macro to prevent duplicated op names in lite interpreter
also reformatted some other files
Test Plan:
verified these op schema are changed
```
-aten::eq(str a, str b) -> (bool)
+aten::eq.str(str a, str b) -> (bool)
-aten::ne(str a, str b) -> (bool)
+aten::ne.str(str a, str b) -> (bool)
-aten::lt(str a, str b) -> (bool)
+aten::lt.str(str a, str b) -> (bool)
-aten::gt(str a, str b) -> (bool)
+aten::gt.str(str a, str b) -> (bool)
-aten::le(str a, str b) -> (bool)
+aten::le.str(str a, str b) -> (bool)
-aten::ge(str a, str b) -> (bool)
+aten::ge.str(str a, str b) -> (bool)
```
Reviewed By: iseeyuan
Differential Revision: D21913049
fbshipit-source-id: 518db068c8c5b0efd19223f0bd94fc3351335dc4
Summary:
Mainly, fix a bug in the HashProvider where it would not include LoopOptions in the hash, meaning two loops would be seen as identical even if they were bound to different thread/block axes. Also added symbolic names for the different axis options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39408
Differential Revision: D21864494
Pulled By: nickgg
fbshipit-source-id: 9c28729984e7a3375e026c78294c9f75b9015123
Summary:
The two bugs were:
* Non-reduction axes were not added when inserting the new ReduceOp, meaning if a reduction with non-reduce axes was rfactored we'd produce bad outputs. There were no tests of Rfactor with non-reduce axis so I modified a test to do this.
* The new statements were always prepended to the block, meaning writes to a buffer could be reordered after the usage of that buffer. This mostly happened in the case where we rfactor a previously rfactored reduction. There was a test of this, but since it only tested rfactoring the outer reduction axis there was never any other statements at the insertion point (the tests of the insertion point argument also do this). I added a new test which covers various rfactor-axis cases.
Also cleaned up tests, removed some helper code we don't need etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39268
Differential Revision: D21864489
Pulled By: nickgg
fbshipit-source-id: d314d20997a8472ec96b72f7a9068d6da6d2399c
Summary:
If the size of a temporary buffer is reduced to zero via binding of a dynamic variable we still run the alloc, even though it is a no op. It's easy to strip these out during simplification, so the expr:
```
{
Allocate(x, int, {0});
// Stuff...
Free(x);
}
```
becomes
```
{
// Stuff...
}
```
I am assuming here that if the allocation size is zero then any usage of the buffer is also eliminated since theres no safe way to refer to a zero size buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38794
Differential Revision: D21723656
Pulled By: nickgg
fbshipit-source-id: 3eaa8bd8974a13b0a351be04abe2348498b31b02
Summary:
Fixes a bug in reorder axis where we append the new reordered loops to the enclosing block, even if there were statements after it. e.g. with 3 Computes:
```
for (int m1 ...
for (int n1 ...
for (int k1 ...
Body 1
for (int m2 ...
for (int n2 ...
for (int k2 ...
Body 2
for (int m3 ...
for (int n3 ...
for (int k3 ...
Body 3
```
If we reorder loops m2 and k2, we were also reordering the body statements like this:
```
for (int m1 ...
for (int n1 ...
for (int k1 ...
Body 1
for (int m3 ...
for (int n3 ...
for (int k3 ...
Body 3
for (int k2 ...
for (int n2 ...
for (int m2 ...
Body 2
```
This is because we always append the new loops to their parent. This PR fixes the logic to replace the old loop root with the new loop, which keeps things consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38841
Differential Revision: D21723670
Pulled By: nickgg
fbshipit-source-id: 1dee8bb153182fcaa2cabd948197577e8e80acd7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39265
In this PR we set id of RecordFunction only when callbacks need them and when
there's at least one active callback
Test Plan:
testRecordFunction unit test in test_misc.cpp
buck test mode/dev caffe2/test/cpp/jit:jit
https://our.intern.facebook.com/intern/testinfra/testrun/8725724291116413
Reviewed By: dzhulgakov
Differential Revision: D21790421
fbshipit-source-id: 016623d7f1a2a271921a71c0483061e232b40321
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39010
The initial version of the serialization for the TensorPipe RPC agent (i.e., the conversion from rpc::Message to tensorpipe::Message) worker around a limitation of TensorPipe of only allowing one payload per message by pickling each tensor separately and storing the pickles as metadata (which is a less efficient way of sending data over, as it goes through more copies). Having now lifter that limitation we can now improve the way we serialize. We now put the type and the id as their own payloads, we do a single pickling pass for all the tensors of the message (which allows us to deduplicate them) and store the pickle as a payload. My impression is that pickling is a somewhat costly operation, so reducing the number of times we do it should be beneficial for performance. For this same reason, another change I've done here is separate the allocation of the buffers from the deserialization. This will allow us (in the future) to perform the allocation on the I/O event loop but perform the unpickling in the worker thread, thus keeping the event loop more responsive.
ghstack-source-id: 104810740
Test Plan: RPC tests
Differential Revision: D21716067
fbshipit-source-id: c1475cc78afdcf0820a485ffd98c91abb35796c7
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/39020 by requiring users to type-hint default arguments to a TorchScript when using the C++ frontend (the Python frontend will insert those automatically).
Since this is a bit of a niche use case, I opted for the simpler solution of making type-hints mandatory for default arguments, as opposed to trying to type-infer them. I left a comment in the code justifying this choice.
Test is included.
/cc t-vi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39021
Differential Revision: D21755317
Pulled By: suo
fbshipit-source-id: e007650d3bfb3a4c58c25ad2c3a17759898f303b
Summary:
In `LoopNest::rfactor` we assume that there is only a single reduction below the insertion point, and when replacing the reduction we recursively replace all reductions below that point. This is not a safe assumption, as a number of transformations can introduce additional ReduceOps - most directly a `splitWithTail` on the innermost reduce axis.
This PR fixes that bug, and adds some unit tests covering the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38733
Differential Revision: D21723634
Pulled By: nickgg
fbshipit-source-id: 3ed6ffcdc2c15aef7504f9b2b91e8d827e0b5d88
Summary:
We do try to eliminate empty For loops, but missed a case where the body Block exists but is empty. In that case we can eliminate the loop as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38883
Differential Revision: D21723680
Pulled By: nickgg
fbshipit-source-id: 49610b0524af5b9ec30ef3b4cc0c8461838259c3
Summary:
Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.
The two remaining pieces missing for reduction support is:
- Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
- Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator
PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.
Also working towards reductions and shape inference for reductions in the fusion pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38627
Reviewed By: albanD
Differential Revision: D21663196
Pulled By: soumith
fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
Summary:
This PR removes the deferred initializer field from ReduceOp in favour of eagerly initializing buffers when they are created (either in the constructor of `LoopNest`, or in `rfactor()`). This allows a pretty good simplification of reduction logic, removing almost all of the reduction expander and the ReduceInitCleaner & unpopular NoOp node added in the last fix.
Eager initialization is better for us anyway because it allows more opportunities to transform the initialization loop.
Added a few more tests, testReduceOverSplitWithTail failed before this change due to a bug in splitWithTail which now can't happen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38585
Differential Revision: D21621551
Pulled By: nickgg
fbshipit-source-id: 378137e5723b4a6d6e390239efb12adce22a8215
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38592
I'm not sure that using couldMoveAfter was incorrect, but using
couldMoveBefore is more consistent with other subgraph-extraction
passes (old fuser, create autodiff graphs, etc.), so it would make it
easier to unify their implementations after this change.
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D21607856
Pulled By: ZolotukhinM
fbshipit-source-id: 970583af7859889d48aacf620ae028258e37a75f
Summary:
Fixes a bug in the following code:
```
Tensor* c = Reduce("sum", {{10, "m"}}, Sum(), b, {{10, "n"}, {10, "k"}});
// split N loop with tail:
loop.splitWithTail(loop.getLoopStmtsFor(c)[1], 8, &outer, &inner, &tail);
```
When this is expanded there are two ReduceOps:
```
for (int m = 0; m < 10; m++) {
for (int n_outer = 0; n_outer < (10 - 0) / 8; n_outer++) {
for (int n_inner = 0; n_inner < 8; n_inner++) {
for (int k = 0; k < 10; k++) {
sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_outer * 8 + n_inner, k]), out_args={m}, reduce_args={n_inner, n_outer, k});
}
}
}
for (int n_tail = 0; n_tail < (10 - 0) % 8; n_tail++) {
for (int k = 0; k < 10; k++) {
sum[m] = ReduceOp(sum, float(0), (sum[m]) + (b[m, n_tail + ((10 - 0) / 8) * 8, k]), out_args={m}, reduce_args={n_tail, k});
}
}
}
```
But each ReduceOp will expand it's initializer, which in this case will overwrite the sum of the split loop:
```
for (int m = 0; m < 10; m++) {
sum[m] = 0.f;
for (int n_inner = 0; n_inner < 8; n_inner++) {
for (int k = 0; k < 10; k++) {
sum[m] = (sum[m]) + (b[(100 * m + k) + 10 * n_inner]);
}
}
sum[m] = 0.f; <------- *HERE*
for (int n_tail = 0; n_tail < 2; n_tail++) {
for (int k = 0; k < 10; k++) {
sum[m] = (sum[m]) + (b[((100 * m + k) + 10 * n_tail) + 80]);
}
}
}
```
The simplest fix is to remove the initializer from the tail loop, which requires adding support for Reductions without an initializer (I did via adding a NoOp Expr rather than handling nullptr). Also moved the ReductionExpander from loopnest.cpp to reduction.h as loopnest is getting a bit heavy.
Added tests for all kinds of splits on a simple 3D reduction to verify no more problems of this type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38420
Differential Revision: D21587583
Pulled By: nickgg
fbshipit-source-id: e0766934481917007119612eb60cc76c3242e44a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37474
Previously we would segfault
Test Plan: Imported from OSS
Differential Revision: D21297542
Pulled By: suo
fbshipit-source-id: c7e2f828a250c490ec23fb51c6a4a642d3370e52
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37948
The input JIT graph has all the information we need to perform the
entire compilation at the construction time. We don't need to postpone
any steps until the execution time. Also, from the graph we always know
what device we will be executing on and thus we don't need to have a
CodeGen cache in TensorExprKernel - we always have one and only one
CodeGen.
Test Plan: Imported from OSS
Reviewed By: protonu
Differential Revision: D21432145
Pulled By: ZolotukhinM
fbshipit-source-id: 8dc86b891713056b2c62f30170cd4a168912f027
Summary:
Implementation of the less popular proposal for eliminating overlap between LetStmt and Let: removing both and storing a mapping between Var and value Expr in the Block.
This complicates some tests but simplifies the IR by restricting where variable binding can occur.
I used the unit tests & python integration tests to verify this is correct but I'm unsure of coverage, particularly around the dependency checker in loopnest - ZolotukhinM your review would be useful there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37606
Differential Revision: D21467483
Pulled By: nickgg
fbshipit-source-id: b402d3fce4cacf35d75f300f0a7dca32a43b6688
Summary:
In the IR Simplifier when doing partial factorization of Round+Mod patterns we divide by the lower number, which could be zero. Add in a quick check against zero avoid the crash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38055
Differential Revision: D21478486
Pulled By: nickgg
fbshipit-source-id: c5083f672e91662b7d1271d817cade7fa6c39967
Summary:
The IR Simplifier early exits when working with dtypes that are not safe to reorder. There are some cases where we still want to simplify ops in these dtypes: x + 0, x - 0, x * 0 and x * 1. It's safe to eliminate the op here and it reduces clutter in the expr.
Also added a quick simplification of casts which do nothing (their type is the same as the underlying).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37960
Differential Revision: D21457736
Pulled By: nickgg
fbshipit-source-id: 40e20a3b55fc1afb2ec50071812238a08bded2ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36291
Move profiler state to be a thread local property,
reuse existing thread local propagation mechanism to ensure
correct profiling of async tasks. This also makes
push/pop callback thread safe and easier to use in e.g.
distributed profilier
Test Plan:
USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install
./build/bin/test_jit
./build/bin/test_jit
python test/test_autograd.py
python test/test_jit.py
Differential Revision: D20938501
Pulled By: ilia-cher
fbshipit-source-id: c0c6c3eddcfea8fc7c14229534b7246a0ad25845
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37745
This PR makes it possible to set TLS callbacks and use
them transparently not only in the main thread but also
in any async tasks
Test Plan: Imported from OSS
Differential Revision: D21374873
Pulled By: ilia-cher
fbshipit-source-id: 3be2e121673b32d7694e17e794f3b474826dffe9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37548
Moving RecordFunction from torch::autograd::profiler into at namespace
Test Plan:
CI
Imported from OSS
Differential Revision: D21315852
fbshipit-source-id: 4a4dbabf116c162f9aef0da8606590ec3f3847aa