Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684
This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.
As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.
For the example in #42560, the exception trace would now look like:
```
> Traceback (most recent call last):
> File "test_autograd.py", line 6914, in test_preserve_backtrace
> Foo.apply(t).sum().backward()
> File "torch/tensor.py", line 214, in backward
> torch.autograd.backward(self, gradient, retain_graph, create_graph)
> File "torch/autograd/__init__.py", line 127, in backward
> allow_unreachable=True) # allow_unreachable flag
> File "torch/autograd/function.py", line 87, in apply
> return self._forward_cls.backward(self, *args)
> File "test_autograd.py", line 6910, in backward
> raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D23365408
fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635
Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358806
Pulled By: eellison
fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.
First up the good stuff, benchmark before:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 5.7917 9.7037 6.9386 6.0448
(100, 100) 5.9338 14.972 7.1139 6.3254
(100, 10000) 21.453 741.54 145.74 12.555
(1000, 1000) 8.0678 122.75 22.833 9.0778
Row sum Caffe2 NNC Simple Better
(10, 100) 5.4502 7.9661 6.1469 5.5587
(100, 100) 5.7613 13.897 21.49 5.5808
(100, 10000) 21.702 82.398 75.462 22.793
(1000, 1000) 22.527 129 176.51 22.517
```
After:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 6.0458 9.4966 7.1094 6.056
(100, 100) 5.9299 9.1482 7.1693 6.593
(100, 10000) 21.739 121.97 162.63 14.376
(1000, 1000) 9.2374 29.01 26.883 10.127
Row sum Caffe2 NNC Simple Better
(10, 100) 5.9773 8.1792 7.2307 5.8941
(100, 100) 6.1456 9.3155 24.563 5.8163
(100, 10000) 25.384 30.212 88.531 27.185
(1000, 1000) 26.517 32.702 209.31 26.537
```
Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).
The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.
It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878
Reviewed By: glaringlee
Differential Revision: D23382499
Pulled By: nickgg
fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274
Reviewed By: malfet
Differential Revision: D23406961
Pulled By: Krovatkin
fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43732.
Requires importing the fft namespace in the C++ API, just like the Python API does, to avoid clobbering torch::fft the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43749
Reviewed By: glaringlee
Differential Revision: D23391544
Pulled By: mruberry
fbshipit-source-id: d477d0b6d9a689d5c154ad6c31213a7d96fdf271
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43456
Introduce the template OperatorGenerator, which returns an optional Operator. It's null if the templated bool value is null.
RegisterOperators() is updated to take the optional Operator. A null will not be registered.
With this update the selective operator registration can be done at compile time. Tests are added to show an operator can be registered if it's in a whitelist and it will not be registered if it's not in the whitelist.
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D23283563
Pulled By: iseeyuan
fbshipit-source-id: 456e0c72b2f335256be800aeabb797bd83bcf0b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43173
With this change the fuser starts to generate typechecks for inputs of
fusion group. For each fusion group we generate a typecheck and an if
node: the true block contains the fused subgraph, the false block
contains unoptimized original subgraph.
Differential Revision: D23178230
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: f56e9529613263fb3e6575869fdb49973c7a520b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43235
This functionality is needed when we want to not lose track of
nodes/values as we merge and unmerge them into other nodes. For
instance, if we have a side data structure with some meta information
about values or nodes, this new functionality would allow to keep that
metadata up to date after merging and unmerging nodes.
Differential Revision: D23202648
Test Plan: Imported from OSS
Reviewed By: eellison
Pulled By: ZolotukhinM
fbshipit-source-id: 350d21a5d462454166f8a61b51d833551c49fcc9
Summary:
This diff normalizes for-loops that have non 0 loop starts to always start from 0. Given a for-loop, this normalization changes the loop start to be 0 and adjusts the loop end and all accesses to the index variable within the loop body appropriately.
This diff also adds tests for several cases of normalization and also tests normalization in conjunction with `splitwithTail` transformation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43179
Reviewed By: nickgg
Differential Revision: D23220534
Pulled By: navahgar
fbshipit-source-id: 64be0c72e4dbc76906084f7089dea81ae07d6020
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43341
This is to remove the empty pretty_print() since it overrides the impl within Module base which is not as designed here.
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D23244616
Pulled By: glaringlee
fbshipit-source-id: 94b8dfd3697dfc450f53b3b4eee6e9c13cafba7b
Summary:
Had a bunch of merged commits that shouldn't have been there, reverted them to prevent conflicts. Lots of new features, highlights listed below.
**Overall:**
- Enables pointwise fusion, single (but N-D) broadcast -- pointwise fusion, single (but N-D) broadcast -- pointwise -- single (but N-D) reduction fusion.
**Integration:**
- Separate "magic scheduler" logic that takes a fusion and generates code generator schedule
- Reduction fusion scheduling with heuristics closely matching eagermode (unrolling supported, but no vectorize support)
- 2-Stage caching mechanism, one on contiguity, device, type, and operations, the other one is input size->reduction heuristic
**Code Generation:**
- More generic support in code generation for computeAt
- Full rework of loop nest generation and Indexing to more generically handle broadcast operations
- Code generator has automatic kernel launch configuration (including automatic allocation of grid reduction buffers)
- Symbolic (runtime) tilling on grid/block dimensions is supported
- Simplified index generation based on user-defined input contiguity
- Automatic broadcast support (similar to numpy/pytorch semantics)
- Support for compile time constant shared memory buffers
- Parallelized broadcast support (i.e. block reduction -> block broadcast support)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43129
Reviewed By: mrshenli
Differential Revision: D23162207
Pulled By: soumith
fbshipit-source-id: 16deee4074c64de877eed7c271d6a359927111b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43069
The transformer c++ impl need to put TransformerEncoderLayer/DecoderLayer and TransformerEncoder/TransformerDecoder in different header since TransformerEncoder/Decoder's options class need TransformerEncoderLayer/DecoderLayer as input parameter. Split header files to avoid cycle includsion.
Test Plan: Imported from OSS
Reviewed By: yf225
Differential Revision: D23139437
Pulled By: glaringlee
fbshipit-source-id: 3c752ed7702ba18a9742e4d47d049e62d2813de0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42880
Enable switching between and checking for training and eval mode for torch::jit::mobile::Module using train(), eval(), and is_training(), like exists for torch::jit::Module.
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D23063006
Pulled By: ann-ss
fbshipit-source-id: b79002148c46146b6e961cbef8aaf738bbd53cb2
Summary:
This changes profiled types from being represented as:
`%23 : Float(4:256, 256:1, requires_grad=0, device=cpu) = prim::profile(%0)`
->
`%23 : Tensor = prim::profile[profiled_type=Float(4:256, 256:1, requires_grad=0, device=cpu)](%0)`
Previously, by representing the profiled type in the IR directly it was very easy for optimizations to accidentally use profiled types without inserting the proper guards that would ensure that the specialized type would be seen.
It would be a nice follow up to extend this to prim::Guard as well, however we have short term plans to get rid of prim::Guard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43035
Reviewed By: ZolotukhinM
Differential Revision: D23120226
Pulled By: eellison
fbshipit-source-id: c78d7904edf314dd65d1a343f2c3a947cb721b32
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637
This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.
Test Plan: Imported from OSS
Reviewed By: izdeby
Differential Revision: D23011572
Pulled By: mrshenli
fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
Summary:
test_e2e_tensorpipe depends on ProcessGroupGloo, therefore it could not be tested with Gloo disabled
Otherwise, it re-introduces https://github.com/pytorch/pytorch/issues/42776
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43041
Reviewed By: lw
Differential Revision: D23122101
Pulled By: malfet
fbshipit-source-id: a8a088b6522a3bc888238ede5c2d589b83c6ea94
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42133
Test Plan:
We save a module with module debugging information as follows.
```
import torch
m = torch.jit.load('./detect.pt')
# Save module without debug info
m._save_for_lite_interpreter('./detect.bc')
# Save module with debug info
m._save_for_lite_interpreter('./detect.bc', _save_debug_info_in_bytecode=True)
```
Size of the file without module debugging information: 4.508 MB
Size of the file with module debugging information: 4.512 MB
Reviewed By: kimishpatel
Differential Revision: D22803740
Pulled By: taivu1998
fbshipit-source-id: c82ea62498fde36a1cfc5b073e2cea510d3b7edb
Summary:
When working on the Cuda Codegen, I found that running the IRSimplifier before generating code lead to test fails. This was due to a bug in Round+Mod simplification (e.g. (x / y * y) + (x % y) => x) to do with the order in which the terms appeared. After fixing it and writing a few tests around those cases, I found another bug in simplification of the same pattern and have fixed it (with some more test coverage).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42934
Reviewed By: zhangguanheng66
Differential Revision: D23085548
Pulled By: nickgg
fbshipit-source-id: e780967dcaa7a5fda9f6d7d19a6b7e7b4e94374b
Summary:
A simple differentiable abstraction to allow testing of full training graphs.
Included in this 1st PR is an example of trivial differentiation.
If approved, I can add a full MLP and demonstrate convergence using purely NNC (for performance testing) in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42548
Reviewed By: ZolotukhinM
Differential Revision: D23057920
Pulled By: bwasti
fbshipit-source-id: 4a239852c5479bf6bd20094c6c35f066a81a832e
Summary:
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.
For example it can replace:
```
A[0] = 0;
for (int x = 0; x < 10; x++) {
A[0] = (A[0]) + x;
}
```
with:
```
int A_ = 0;
for (int x = 0; x < 10; x++) {
A_ = x + A_;
}
A[0] = A_;
```
This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.
This diff got a bit unwieldy with the integration code so that will come in a follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42606
Reviewed By: bertmaher
Differential Revision: D22970969
Pulled By: nickgg
fbshipit-source-id: 831fd213f486968624b9a4899a331ea9aeb40180
Summary:
Added a new option in AutogradContext to tell autograd to not materialize output grad tensors, that is, don't expand undefined/None tensors into tensors full of zeros before passing them as input to the backward function.
This PR is the second part that closes https://github.com/pytorch/pytorch/issues/41359. The first PR is https://github.com/pytorch/pytorch/pull/41490.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41821
Reviewed By: albanD
Differential Revision: D22693163
Pulled By: heitorschueroff
fbshipit-source-id: a8d060405a17ab1280a8506a06a2bbd85cb86461
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42714
Change two unit tests for the lite trainer to register two instances/objects of the same submodule type instead of the same submodule object twice.
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D22990736
Pulled By: ann-ss
fbshipit-source-id: 2bf56b5cc438b5a5fc3db90d3f30c5c431d3ae77
Summary:
Awhile back when commonizing the Let and LetStmt nodes, I ended up removing both and adding a separate VarBinding section the Block. At the time I couldn't find a counter example, but I found it today: Local Vars and Allocations dependencies may go in either direction and so we need to support interleaving of those statements.
So, I've removed all the VarBinding logic and reimplemented Let statements. ZolotukhinM I think you get to say "I told you so". No new tests, existing tests should cover this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42634
Reviewed By: mruberry
Differential Revision: D22969771
Pulled By: nickgg
fbshipit-source-id: a46c5193357902d0f59bf30ab103fe123b1503f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522
Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.
There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.
I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.
Test Plan: CI
Reviewed By: malfet
Differential Revision: D22959472
fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42570
ProfiledType doesn't do anything and is not used atm, removing
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D22938664
Pulled By: ilia-cher
fbshipit-source-id: 037c512938028f44258b702bbcde3f8c144f4aa0
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.
Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:
```
import torch.fft
t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```
See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911
Reviewed By: glaringlee
Differential Revision: D22941894
Pulled By: mruberry
fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
Summary:
Unroll a loop with constant boundaries, replacing it with multiple
instances of the loop body. For example:
```
for x in 0..3:
A[x] = x*2
```
becomes:
```
A[0] = 0
A[1] = 2
A[2] = 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42465
Test Plan: `test_tensorexpr` unit tests.
Reviewed By: agolynski
Differential Revision: D22914418
Pulled By: asuhan
fbshipit-source-id: 72ca10d7c0b1ac7f9a3688ac872bd94a1c53dc51
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42567
Before this change we didn't expand arguments, and thus in an expr
`sigmoid(sigmoid(x))` only the outer call was expanded.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D22936177
Pulled By: ZolotukhinM
fbshipit-source-id: 9c05dc96561225bab9a90a407d7bcf9a89b078a1