Summary:
For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2.
Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377
Reviewed By: malfet
Differential Revision: D22758459
Pulled By: ezyang
fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42494
Note that we're currently assuming that dtypes of all the arguments and
the return value is the same.
Test Plan: Imported from OSS
Reviewed By: nickgg
Differential Revision: D22910755
Pulled By: ZolotukhinM
fbshipit-source-id: 7f899692065428fbf2ad05d22b4ca39cab788ae5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42137
This PR implements an SGD optimizer class similar to torch::optim::SGD, but it doesn't inherit from torch::optim::Optimizer, for use on mobile devices (or other lightweight use case).
Adding Martin's comment for visibility: "SGD may be the only optimizer used in near future. If more client optimizers are needed, refactoring the full optim codes and reusing the existing code would be an option."
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22846514
Pulled By: ann-ss
fbshipit-source-id: f5f46804aa021e7ada7c0cd3f16e24404d10c7eb
Summary:
A heavy refactor of bounds inference to fix some issues and bugs blocking using it to analyze cross thread interactions:
* We were merging all accesses to a Buf into a single bounds info entry, even if they did not overlap. E.g. if we accessed a[0:2] and a[5:6] we would merge that into a bound of a[0:6]. I've changed this behaviour to merge only overlapping bounds.
* We were not separating bounds of different kinds (e.g. Load vs Store) and would merge a Store bounds into a Load bounds, losing the information about what kind of access it was. E.g. this loop would produce bounds: [{Load, 0, 10}] and now produces bounds [{Load, 0, 9}, {Store, 1, 10}]:
```
for i in 1 to 10...
x[i] = x[i-1]
```
* Both ComputeAt and Rfactor relied on the overzealous merging and only used a single entry in the bounds list to determine the bounds of temporary buffers they created, which could result in temporary buffers allocated smaller than accesses to them. I've fixed Rfactor, but *not* ComputeAt - however all ComputeAt tests still pass (may require loop fusion to trigger this issue) - I will come back to it.
Being more precise about bounds is more complex, rather than taking the minimum of starts and maximum of stops we now need to determine if two bounds overlap or are adjacent. There are many edge cases and so I've added a bunch of test coverage of the merging method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42185
Reviewed By: mruberry
Differential Revision: D22870391
Pulled By: nickgg
fbshipit-source-id: 3ee34fcbf0740a47259defeb44cba783b54d0baa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41451
Since TE operates on a limited subset of ops with a well-defined
semantics, we can easily infer shapes of intermediate and output tensors
given shapes of the inputs.
There is a couple of ops that are not yet supported in the shape
inference, once we add them we could relax the shape info requirements
in the TE fuser: currently it requires all values in the fusion group to
have shapes known and we can change it to only inputs.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22543470
Pulled By: ZolotukhinM
fbshipit-source-id: 256bae921028cb6ec3af91977f12bb870c385f40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42121
This PR changes the Module API to allow register a module with module
interface type, and therefore allows Module::clone works on the case
where there's a module interface type being shared by two submodules.
interface type will be shared by the new cloned instance in the same
compilation unit bc it only
contains a list of functionSchema, which does not involve any
attributes compared to classType.
fixes https://github.com/pytorch/pytorch/issues/41882
Test Plan: Imported from OSS
Reviewed By: suo
Differential Revision: D22781205
Pulled By: wanchaol
fbshipit-source-id: f97f4b75970f0b434e38b5a1f778eda2c4e5109b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42225
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CircleCI is all green.
Reviewed By: beauby
Differential Revision: D22812445
fbshipit-source-id: e6d824bb28f5afe75fd765de0430968174f3531f
Summary: function `cross_kernel_scalar` is not covered in `Aten/native/cpu/CrossKernel.cpp`, add tests to cover it
Test Plan:
1. Test locally to check new lines are covered
2. CI
https://pxl.cl/1fZjG
Reviewed By: malfet
Differential Revision: D22834122
fbshipit-source-id: 0d50f3a3e6aee52cb6fdee2b9f5883f542c7b6e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42266
function `lerp_kernel_scalar` and `lerp_kernel_tensor` are not covered in `Aten/native/cpu/LerpKernel.cpp`, add tests to cover them
Test Plan:
1. Test locally to check new lines are covered
2. CI
https://pxl.cl/1fXPd
Reviewed By: malfet
Differential Revision: D22832164
fbshipit-source-id: b1eaabbf8bfa08b4dedc1a468abfdfb619a50e3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42045
This PR changes the save_data() member functions of torch::jit::mobile::Module which was introduced in #41403 to be the non member function torch::jit::mobile::_save_parameters() (taking a mobile Module as its first argument).
In addition, this PR:
* adds a getter function _ivalue() for the mobile::Module object
* renames torch::jit::mobile::_load_mobile_data() to torch::jit::mobile_load_parameters()
* refactors the import.h header file into import.h and import_data.h
Test Plan: Imported from OSS
Reviewed By: kwanmacher, iseeyuan
Differential Revision: D22766781
Pulled By: ann-ss
fbshipit-source-id: 5cabae31927187753a958feede5e9a28d71d9e92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42037
This is to fix#41951
Test Plan: Imported from OSS
Reviewed By: yf225
Differential Revision: D22764717
Pulled By: glaringlee
fbshipit-source-id: e6da0aeb05a2356f52446e6d5fad391f2cd1cf6f
Summary:
Currently constant pooling runs before const propagation, which can create more constants that need pooling. This can get in the way of serialization/deserialization stability because each time user serializes and deserializes a module, runCleanUpPasses is called upon it. Doing so multiple times would lead to different saved module.
This PR moves constant pooling after const propagation, which may slow down const propagation a little bit, but would otherwise side-step aforementioned problem.
test_constant_insertion in test_jit.py is also updated because after fixing the pass ordering, the number of constants is no longer a constant and it is extremely difficult to get the exact number with the current convoluted test structure. So for now, I changed the test to check only that CSE doesn't change number of "prim::constant" rather than comparing against a known number. Also left a TODO to improve this test.
ConstantPropagation pass is replaced by ConstantPropagationImmutableTypes because the latter is used in runCleanUpPasses. If not replaced, the former would create new CSE opportunities by folding more constants. This voids the purpose of the test case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41891
Reviewed By: colesbury
Differential Revision: D22701540
Pulled By: gmagogsfm
fbshipit-source-id: 8e60dbdcc54a93dac111d81b8d88fb39387224f5
Summary:
Fix a bug in SplitWithTail and SplitWithMask where loop_options such as Cuda block/thread bindings are overwritten by the split. This PR fixes this bug by propagating the loop options to the outer loop, which for axis bindings should be equivalent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40035
Reviewed By: ZolotukhinM
Differential Revision: D22080263
Pulled By: nickgg
fbshipit-source-id: b8a9583fd90f69319fc4bb4db644e91f6ffa8e67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41806
Generally a good practice not to have tests spew output.
Test Plan:
`build/bin/test_tensorexpr`
Imported from OSS
Reviewed By: zheng-xq
Differential Revision: D22646833
fbshipit-source-id: 444e883307d058fe77e7550d436fa61b7d91a701
Summary:
Auto fuse the output loops of outer Rfactors, so it is in a more convenient format for binding GPU axes.
An example:
```
Tensor* c = Reduce("sum", {}, Sum(), b, {{m, "m"}, {n, "n"}, {k, "k"}});
LoopNest loop({c});
std::vector<For*> loops = loop.getLoopStmtsFor(c);
auto v = loops.at(0)->var();
loop.rfactor(c->body(), v);
```
Before:
```
{
Allocate(tmp_buf, float, {m});
sum[0] = 0.f;
for (int m_1 = 0; m_1 < m; m_1++) {
tmp_buf[m_1] = 0.f;
}
for (int m_1 = 0; m_1 < m; m_1++) {
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m_1] = (tmp_buf[m_1]) + (b[((n_1 * m_1) * k_1 + k) + k_1 * n]);
}
}
}
for (int m_1 = 0; m_1 < m; m_1++) {
sum[0] = (sum[0]) + (tmp_buf[m_1]);
}
Free(tmp_buf);
}
```
After:
```
{
sum[0] = 0.f;
for (int m = 0; m < m_1; m++) {
Allocate(tmp_buf, float, {m_1});
tmp_buf[m] = 0.f;
for (int n = 0; n < n_1; n++) {
for (int k = 0; k < k_1; k++) {
tmp_buf[m] = (tmp_buf[m]) + (b[((n_1 * m) * k_1 + k) + k_1 * n]);
}
}
sum[0] = (sum[0]) + (tmp_buf[m]);
Free(tmp_buf);
}
}
```
The existing Rfactor tests cover this case, although I did rename a few for clarity. This change broke the LLVMRFactorVectorizedReduction test because it now does what its intending to (vectorize a loop with a reduction in it) rather than nothing, and since that doesn't work it correctly fails. I've disabled it for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40050
Reviewed By: ZolotukhinM
Differential Revision: D22605639
Pulled By: nickgg
fbshipit-source-id: e359be53ea62d9106901cfbbc42d55d0e300e8e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41376
torch::jit::mobile::Module does not currently support accessing parameters via their attribute names, but torch::jit::Module does. This diff adds an equivalent functionality to mobile::Module.
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22609142
Pulled By: ann-ss
fbshipit-source-id: 1a5272ff336f99a3c0bb6194c6a6384754f47846
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37587
Lifting RecordFunction up into the dispatcher code
Test Plan: Imported from OSS
Differential Revision: D21374246
fbshipit-source-id: 19f9c1719e6fd3990e451c5bbd771121e91128f7
Summary:
Leave undefined tensors / None returned from custom backward functions as undefined/None instead of creating a tensor full of zeros. This change improves performance in some cases.
**This is BC-Breaking:** Custom backward functions that return None will now see it potentially being propagated all the way up to AccumulateGrad nodes. Potential impact is that .grad field of leaf tensors as well as the result of autograd.grad may be undefined/None where it used to be a tensor full of zeros. Also, autograd.grad may raise an error, if so, consider using allow_unused=True ([see doc](https://pytorch.org/docs/stable/autograd.html?highlight=autograd%20grad#torch.autograd.grad)) if it applies to your case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41490
Reviewed By: albanD
Differential Revision: D22578241
Pulled By: heitorschueroff
fbshipit-source-id: f4966f4cb520069294f8c5c1691eeea799cc0abe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41507
These fields have always been a part of tensor types, this change just
makes them serializable through IR dumps.
Test Plan: Imported from OSS
Reviewed By: Krovatkin, ngimel
Differential Revision: D22563661
Pulled By: ZolotukhinM
fbshipit-source-id: f01aaa130b7e0005bf1ff21f65827fc24755b360
Summary:
Update the API to access grad in cpp to avoid unexpected thread safety issues.
In particular, with the current API, a check like `t.grad().defined()` is not thread safe.
- This introduces `t.mutable_grad()` that should be used when getting a mutable version of the saved gradient. This function is **not** thread safe.
- The `Tensor& grad()` API is now removed. We could not do a deprecation cycle as most of our call side use non-const Tensors that use the non-const overload. This would lead to most calls hitting the warning. This would be too verbose for all the users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40887
Reviewed By: ezyang
Differential Revision: D22343932
Pulled By: albanD
fbshipit-source-id: d5eb909bb743bc20caaf2098196e18ca4110c5d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893
Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.
The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.
As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167
Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D21112208
fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41145
**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.
**Fixes**
This commit fixes#40067.
Test Plan: Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D22510076
Pulled By: SplitInfinity
fbshipit-source-id: f65964ef3092a095740f06636ed5b1eb0884492d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40718
Currently only constant except tensor must be inlined during serialization.
Tensor are stored in the contant table. This patch generalizes this capability
to any IValue. This is particularly useful for non ASCII string literal that
cannot be inlined.
Test Plan: Imported from OSS
Differential Revision: D22298169
Pulled By: bzinodev
fbshipit-source-id: 88cc59af9cc45e426ca8002175593b9e431f4bac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40326
Adds a helper function `addCallbackWithTLSState` to both
torch/csrc/utils/future.h which is used internally by RPC framework and the JIT
future. Uses this helper function to avoid to pass in TLS state where it is needed for rpc and `record_function_ops.cpp`. For example, the following:
```
at::ThreadLocalState tls_state;
fut->addCallback([tls_state = std::move(tls_state)]() {
at::ThreadLocalStateGuard g(tls_state);
some_cb_that_requires_tls_state();
}
```
becomes
```
fut->addCallbackWithTLSState(some_cb_that_requires_tls_state);
```
ghstack-source-id: 107383961
Test Plan: RPC Tests and added a test in test_misc.cpp
Differential Revision: D22147634
fbshipit-source-id: 46c02337b90ee58ca5a0861e932413c40d06ed4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40842
**Summary**
This commit adds out-of-source-tree tests for `to_backend`. These tests check
that a Module can be lowered to a backend, exported, loaded (in both
Python and C++) and executed.
**Fixes**
This commit fixes#40067.
Test Plan: Imported from OSS
Differential Revision: D22418731
Pulled By: SplitInfinity
fbshipit-source-id: 621ba4efc1b121fa76c9c7ca377792ac7440d250
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40840
**Summary**
This commit moves the TestBackend used for the JIT backend
extension to the tests directory. It was temporarily placed
in the source directory while figuring out some details of
the user experience for this feature.
**Test Plan**
`python test/test_jit.py TestBackends`
**Fixes**
This commit fixes#40067.
Test Plan: Imported from OSS
Differential Revision: D22418682
Pulled By: SplitInfinity
fbshipit-source-id: 9356af1341ec4d552a41c2a8929b327bc8b56057
Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40864
Reviewed By: ngimel
Differential Revision: D22392877
Pulled By: soumith
fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37034
c10 takes a Stack* in boxed functions while JIT took Stack&.
c10 doesn't return anything while JIT returns an int which is always zero.
This changes JIT to follow the c10 behavior.
ghstack-source-id: 106834069
Test Plan: unit tests
Differential Revision: D20567950
fbshipit-source-id: 1a7aea291023afc52ae706957e9a5ca576fbb53b
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234
This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous.
e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16.
This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases.
WIP to write a vectorized version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516
Reviewed By: ezyang
Differential Revision: D22106251
Pulled By: ngimel
fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9
Summary:
BC NOTE:
This change makes it so modules saved with torch.jit.save in PyTorch 1.6 can be loaded by previous versions of PyTorch unless they use torch.div or (soon) torch.full. It also lets tensors saved using torch.save be loaded by previous versions. So this is the opposite of BC-breaking, but I'm using that label to highlight this issue since we don't have a "BC-improving" label.
PR NOTE:
When an operator's semantics change in PyTorch we want to do two things:
1) Preserve the semantics of older serialized Torchscript programs that use the operator
2) Ensure the new semantics are respected
Historically, this meant writing a Versioned Symbol that would remap older versions of the operator into current PyTorch code (1), and bumping the produced file format version (2). Unfortunately, bumping the produced file format version is a nuclear option for ensuring semantics are respected, since it also prevents older versions of PyTorch from loading anything (even tensors!) from newer versions.
Dynamic versioning addresses the nuclear consequences of bumping the produced file format version by only bumping it when necessary. That is, when an operator with changed semantics is detected in the serialized Torchscript. This will prevent Torchscript programs that use the changed operator from loading on earlier versions of PyTorch, as desired, but will have no impact on programs that don't use the changed operator.
Note that this change is only applicable when using torch.jit.save and torch.jit.load. torch.save pickles the given object using pickle (by default), which saves a function's Python directly.
No new tests for this behavior are added since the existing tests for versioned division in test_save_load already validate that models with div are loaded correctly at version 4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40279
Reviewed By: dzhulgakov
Differential Revision: D22168291
Pulled By: mruberry
fbshipit-source-id: e71d6380e727e25123c7eedf6d80e5d7f1fe9f95
Summary:
BC-breaking NOTE:
In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.
PR NOTE:
This PR causes torch.full to throw a runtime error when it would have inferred a float dtype by being given a boolean or integer value. A versioned symbol for torch.full is added to preserve the behavior of already serialized Torchscript programs. Existing tests for this behavior being deprecated have been updated to reflect it now being unsupported, and a couple new tests have been added to validate the versioned symbol behavior. The documentation of torch.full has also been updated to reflect this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40364
Differential Revision: D22176640
Pulled By: mruberry
fbshipit-source-id: b20158ebbcb4f6bf269d05a688bcf4f6c853a965
Summary:
Slightly modified Adam, following the python implementation, and the `ProducesPyTorchValues` tests pass. I had a problem with another test though (see commit c1a6241676ab84fc531c1c3a10f964aa5704092e), it seems that optimizing for two steps with the same optimizer vs optimizing for two steps using freshly initialized objects will produce the same output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40009
Differential Revision: D22096053
Pulled By: glaringlee
fbshipit-source-id: a31a8f5488cb37c53752ddf15436efabdba67dc4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39950
Per the comment in the code, constValue() should only be used in
the case where the future was complete and value was not an error.
Add an assert to enforce this.
Also, add hasValue() accessor for completeness.
ghstack-source-id: 105815597
Test Plan: buck test mode/dev-nosan caffe2/test/cpp/jit:
Differential Revision: D22021776
fbshipit-source-id: b59b6c775eab344068a76f4cd8c3a9dc1f2a174e
Summary:
Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch.
Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available.
Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic.
Offers both Python and ATen interfaces
Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683
Differential Revision: D21998093
Pulled By: ezyang
fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737