Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74768
As commented in code:
```
// Empty List Literals that are not assigned to variables
// may match to any list type in schema matching,
// but still default to List[Tensor] if assigned to a variable
// or returned from a function
// Restricting empty list matching to temporary values
// avoids difficult to handle cases such as
// a = []
// b = a
// if cond:
// b.append(2)
// else:
// a.append("hi")
// This is also the same behavior that C++ allows with {}
// (cannot assign to a variable typed as auto)
```
Fix for https://github.com/facebookresearch/torchdynamo/issues/95
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D35362760
Pulled By: eellison
fbshipit-source-id: da23e8889312001b60d64a1758da5c578b6fe5ea
(cherry picked from commit 75682f17204d6d444e7e7144472c6e833150c601)
Summary:
## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:
- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

- SkyLake 8180 (single thread):

\* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is:
```
caffe2/CMakeLists.txt
```
## Limitations
- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111
Reviewed By: eellison
Differential Revision: D34584878
Pulled By: malfet
fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4
(cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73329
There is a quantization use case for having better alias analysis with function calls remaining. This does the relatively dumb approach of getting the inlined graph of each function call, and then analyzing that subgraph. Since we need a unique single analysis of every `Value*`, for every function call make a copy of the graph for every analysis past the first. This is relatively slow, but given the limited use case here should work well enough (and is no slower than calling the inlining pass).
cc vkuzo
Test Plan: Imported from OSS
Reviewed By: davidberard98
Differential Revision: D34451424
Pulled By: eellison
fbshipit-source-id: b7c7e54679d723f5ded1e11ffb32eb6d2176431d
(cherry picked from commit 81a42b31522b890311a3f512448b372c4ebbefd1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69854
ghstack-source-id: 148315147
Test Plan: Time reported to start up static runtime on ctr_mobile_feed local_ro net is 8.8s instead of 9.5s
Reviewed By: suo, d1jang
Differential Revision: D33039733
fbshipit-source-id: 218dc7ff9aa421a352b71952ec77757368095860
(cherry picked from commit 7586712948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69853
We can implement this overload more efficiently.
ghstack-source-id: 146924693
Test Plan:
patched alias_analysis tests
Time reported to initialize a predictor by static runtime when given ctr_mobile_feed local_ro net is 9.5s instead of 10.5s.
Reviewed By: mikeiovine
Differential Revision: D33039731
fbshipit-source-id: 52559d678e9eb00e335b9e0db304e7a5840ea397
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71170
As with an output in a tuple return, an output in a list return will not have any further uses that would make adding it directly to the list's contained elements give incorrect behavior. This unblocks a use case in op authoring.
cc Chillee
Test Plan: Imported from OSS
Reviewed By: d1jang
Differential Revision: D33535608
Pulled By: eellison
fbshipit-source-id: 2066d28e98c2f5d1b3d7e0206c7e39a27b3884b1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69476
This diff adds a new op, `TensorExprDynamicGroup`, that composes all the logic behind running a dynamic shaped fused node. This includes a guard instruction that checks for conditions, a conditional that calls the fused node or the fallback graph depending on the guard.
ghstack-source-id: 146107006
Test Plan:
```
buck test mode/dev-nosan //caffe2/test/cpp/jit:jit
```
Reviewed By: eellison
Differential Revision: D32320082
fbshipit-source-id: 2bd1a43391ca559837d78ddb892d931abe9ebb73
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68099
When an op in the graph cannot be matched to any known ops, alias_analysis.cpp throws an error.
Before:
```
RuntimeError: 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":612, please report a bug to PyTorch. We don't have an op for aten::add but it isn't a special case. Argument types: Tensor, float, Tensor,
```
After:
```
RuntimeError: 0INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":612, please report a bug to PyTorch. We don't have an op for a
ten::add but it isn't a special case. Argument types: Tensor, float, Tensor,
Candidates:
aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> (Tensor)
aten::add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> (Tensor)
aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> (Tensor(a!))
aten::add.t(t[] a, t[] b) -> (t[])
aten::add.str(str a, str b) -> (str)
aten::add.int(int a, int b) -> (int)
aten::add.complex(complex a, complex b) -> (complex)
aten::add.float(float a, float b) -> (float)
aten::add.int_complex(int a, complex b) -> (complex)
aten::add.complex_int(complex a, int b) -> (complex)
aten::add.float_complex(float a, complex b) -> (complex)
aten::add.complex_float(complex a, float b) -> (complex)
aten::add.int_float(int a, float b) -> (float)
aten::add.float_int(float a, int b) -> (float)
aten::add(Scalar a, Scalar b) -> (Scalar)
```
Test Plan:
Run
```
import torch
if __name__ == '__main__':
ir = """
graph(%x : Tensor,
%y : Tensor):
%2 : float = prim::Constant[value=1.2]()
%result : Tensor= aten::add(%x, %2, %y)
return (%result)
"""
x = torch.tensor([[1., 2.], [3., 4.]])
y = torch.tensor([[2., 1.], [2., 1.]])
graph = torch._C.parse_ir(ir)
print(graph)
graph.alias_db().analyze()
# print(script(x, y))
```
to get the results above
Imported from OSS
Reviewed By: anjali411
Differential Revision: D32339639
fbshipit-source-id: a79a3c2f157154b5fb1e3f33a23e43b7884e8e38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66554
In native_functions.yaml, the schemas for batch_norm and instance_norm
are incorrect: the inputs `running_mean` and `running_var` are mutated,
but are not marked as such in the function schema. Since `(a!)?`
annotations are currently not working (see #65760), this instead adds a
special case to `alias_anaysis.cpp`. If the value of `training` or
`use_input_stats` is known to be `false`, then `alias_analysis` will
mark the input as _not_ being written to.
Test Plan:
Removed the `skip` annotation on the following test, and added a special
exception in `check_alias_annotations`:
```
python test/test_ops.py -k test_variant_consistency_jit_nn_functional_batch_norm
```
Also:
```
./build/bin/test_jit --gtest_filter="*BatchAndInstanceNormFixture*"
```
Imported from OSS
Reviewed By: eellison
Differential Revision: D31612339
fbshipit-source-id: 12ca61b782b9e41e06883ba080a276209dc435bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66295
Tidying up the top sources of reference count decrements seen during static runtime startup in alias_analysis.cpp specifically.
ghstack-source-id: 140484160
Test Plan:
CI
perf now shows under 2% time spend in ~__shared_count instead of about 5%.
Reviewed By: suo
Differential Revision: D31490761
fbshipit-source-id: bbdcb7f9065c3aafa7fff7bfea9cea6dbc41f9d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65344
Callsites that know they are using a cache can borrow AliasTypeSets from the cache instead of copying them.
ghstack-source-id: 140484162
Test Plan: Running perf on static runtime startup seems to show less inclusive time spent in AliasDb::getElements
Reviewed By: ejguan
Differential Revision: D31027363
fbshipit-source-id: b7a1473f4f9e9f14566f56f4b3b4e6317076beeb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65177
There is no need to heap-allocate any vectors in this case.
ghstack-source-id: 140052520
Test Plan:
CI
Startup for static runtime on ctr_mobile_feed local net decreased from 7.8s to about 7.0s
Reviewed By: malfet
Differential Revision: D30984194
fbshipit-source-id: 85091e55445f653ec728b27da4b459a2f1873013
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66025
This change adds an option to selectively enable precise alias analysis for `prim::`TupleConstruct` (introduced by D30437737 (cd458fe092)) to minimize its exposure only to `StaticRuntime` as of now.
Test Plan: Modified existing unit tests whose behavior depends on D30437737 (cd458fe092).
Reviewed By: eellison
Differential Revision: D31350285
fbshipit-source-id: 3ce777f07f99650d74634481ad0805192dce55c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64879
This change makes the output of `prim::TupleConstruct` alias only with its inputs *when* the created tuple is directly returned from the graph.
The same treatment could be made to any tuples newly constructed by `prim::TupleConstruct` if they do not let their elements escape. However, this change only focuses on only one simplest, but frequently used usecase: tuples constructed only to be returned from a graph. This usecase turns out to be very often used.
Test Plan:
Added
- `AliasMoveForTupleConstructWithSingleUseAsGraphOutput`
- `WildcardAliasForTupleConstructWithUses`
to cover the newly added code.
Reviewed By: eellison
Differential Revision: D30437737
fbshipit-source-id: 417fbc6bc348062e60e7acdddd340d4754d090eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64824
See comment in function_schema.h for explanation. I claim that this is a good tradeoff because the aliasing information seems to be used only in compiler-ish code paths, where performance isn't as critical as actual execution. If performance is important there too, perhaps we should hoist isWrite into the Argument itself since there are several paths that only care about isWrite.
ghstack-source-id: 138958896
Test Plan: CI, profile schema parsing on startup and see much fewer page faults in createArgumentVector.
Reviewed By: suo
Differential Revision: D30860719
fbshipit-source-id: 1d4d2328f2b8e34f5ddf9d82083fd4dd7b7f738f
Summary:
This PR is created to replace https://github.com/pytorch/pytorch/pull/53180 PR stack, which has all the review discussions. Reason for needing a replacement is due to a messy Sandcastle issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64234
Reviewed By: gmagogsfm
Differential Revision: D30656444
Pulled By: ansley
fbshipit-source-id: 77536c8bcc88162e2c72636026ca3c16891d669a
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51600
Looking for notes on implementation first, will post more notes on benchmarks and overall thoughts/implementation and solicit more input soon.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26696702
Pulled By: eellison
fbshipit-source-id: cd612f093fe3859e42fb0b77560ebd1b44fccff7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51483
This PR moves the conv weights of a frozen model to MKLDNN, and AOT reorders the weights. When the weights are already in MKLDNN, just computing a single conv by converting the input and output from/to mkldnn provides large speedups. I benchmark'd the results of the top 200 shapes in predictor [here](https://www.internalfb.com/phabricator/paste/view/P171537938), as well as verified that it sped up popular models in torchvision.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D26696703
Pulled By: eellison
fbshipit-source-id: 0b4441bee4f6e0890a4540fbca3bb5e58b8c5adf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51999
as in title
Test Plan: waiting on CI for now
Reviewed By: eellison
Differential Revision: D26349297
fbshipit-source-id: bd5574ed1f8448ba18a6fda4bdc45f45d8b158e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50229
`fastmod -m 'cast(<((at|c10)::)?\w+Type>\(\)\s*)->' 'castRaw${1}->'` Presuming it builds, this is a safe change: the
result of `cast()` wasn't being saved anywhere, so we didn't need
it, so we can use a raw pointer instead of a new `shared_ptr`.
ghstack-source-id: 120769170
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837494
fbshipit-source-id: 46319100dc0dfc78f6d2b45148207f83481f2ada
Summary:
This PR adds a simple debugging helper which exports the AliasDb state as a [GraphViz](http://www.graphviz.org/) graph definition. The generated files can be viewed with any Graphviz viewer (including online based, for example http://viz-js.com)
Usage:
1. Call `AliasDb::dumpToGraphvizFile()` from a debugger. Using gdb for example:
`call aliasDb_->dumpToGraphvizFile("alias.dot")`
2. Add explicit calls to `AliasDb::dumpToGraphvizFile()`, which returns `true` if it succeeds.
An example output file is attached: [example.zip](https://github.com/pytorch/pytorch/files/5805840/example.zip)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50452
Reviewed By: ngimel
Differential Revision: D25980222
Pulled By: eellison
fbshipit-source-id: 47805a0a81ce73c6ba859340d37b9a806f9000d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228
`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837374
fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
Summary:
=======
This PR addresses the following:
* Adds JIT support for CUDA Streams
* Adds JIT support for CUDA Events
* Adds JIT support for CUDA Stream context manager
Testing:
======
python test/test_jit.py -v TestCUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48020
Reviewed By: navahgar
Differential Revision: D25725749
Pulled By: nikithamalgifb
fbshipit-source-id: b0addeb49630f8f0c430ed7badeca43bb9d2535c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46686
I was trying to page this code back in after a while and some things
stuck out as unnecessarily confusing.
1. Improve documentation of closures and fork stuff to be more accurate
to how we use them today.
2. Change `prim::LocalVariableScope` to `prim::ListComprehension`. It is
only ever used for a list comprehensions, and in general the nodes
emitted by `ir_emitter` should correspond to concrete operations or
language features rather than semantic constraints.
3. Change the somewhat mysterious "inputs" and "attributes" argument
names throughout the codebase to be the more obvious "args" and "kwargs"
that they generally represent (I think "inputs" and "attributes" come
from the AST naming).
Test Plan: Imported from OSS
Reviewed By: navahgar, jamesr66a
Differential Revision: D24464197
Pulled By: suo
fbshipit-source-id: 1f4b1475b58b5690a0b204e705caceff969533b4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39111
In our present alias analysis, we consider any Value that enter another container as entering the heap, and thus aliasing all other heap values of the same type. There are a number of advantages to this approach:
- it is not to hard to maintain the aliasDb implementation
- it is much easier from an op schema perspective - there are many composite list ops registered internally and externally that would be tricky to register and get right if we did something more complicated
- It limits the size of the AliasDb, because a container of size 10 only contains a single memory dag element instead of 10 elements.
The downside is that we have are unable to handle the simple and extremely common case of a list of tensors being used in an ATen op.
In an example like:
```
def foo(input):
x = torch.tensor([1, 2, 3, 4])
y = [x, x]
input.add_(1)
return torch.cat(y)
```
we will consider x to be written to. any write to any wildcard element (an element that enters a tuple, an element that is taken from a list) will mark x as written to. This can be limiting for our ability to create a functional subset and fuse graphs - as a result, 4 of TorchVision classification models could not be functionalized.
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D23828003
Pulled By: eellison
fbshipit-source-id: 9109fcb6f2ca20ca897cae71683530285da9d537
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43043
This add the support for rpc_sync in TorchScript in a way similar to
rpc_async
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23252039
Pulled By: wanchaol
fbshipit-source-id: 8a05329cb8a24079b2863178b73087d47273914c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43635
Intern the symbol, no functional changes. Aliasing need to be looked at but this should be done in a separate PR; this PR is just changing the symbol.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D23358806
Pulled By: eellison
fbshipit-source-id: f18bcd142a0daf514136f019ae607e4c3f45d9f8
Summary:
This PR adds API to package unoptimized/fallback blocks as function calls. It's mainly meant to be used by TensorExpressionsFuser and SpecializeAutogradZero passes as both specialize the original graph but would also like to provide a fallback path in case the assumptions under which the graph was specialized do not hold for some inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43274
Reviewed By: malfet
Differential Revision: D23406961
Pulled By: Krovatkin
fbshipit-source-id: ef21fc9ad886953461b09418d02c75c58375490c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43633
In the backward graph, _grad_sum_to_size is inserted whenever a possibly broadcasting op is called:"
`"aten::_grad_sum_to_size(Tensor(a) self, int[]? size) -> Tensor(a)"`
If a broadcast occurred, a sum is called, otherwise the second input is None and it is a no-op. Most of the time, it's a no-op (in the fast RNNs benchmark > 90% of the time).
We can get rid of this op by profiling the optionality of the second input. I added `prim::profile_optional` to do this, which counts the number of times it saw a None value and the number of times it saw a value present. When specializing the backward graph, we insert checks for values we profiled as None, and in the optimized block can remove the grad_sum_to_size calls that use those values.
In the future we may revisit this when NNC supports reductions and we want to replace grad_sum_to_size with sums as well, but I think this is worth landing now.
Test Plan: Imported from OSS
Reviewed By: bwasti, ZolotukhinM
Differential Revision: D23358809
Pulled By: eellison
fbshipit-source-id: a30a148ca581370789d57ba082d23cbf7ef2cd4d