Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66337
Right now, assembly code generated for the a given method from the model is named wrapper or func by default. The function name is then replaced with a proper kernel_func_name after target specific assembly is generated.
This PR propagates a desired kernel_func_name right from aotCompiler API so that the generated function has the needed name that doesn't need to be replaced later.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31514095
Pulled By: priyaramani
fbshipit-source-id: b70c8e2c733600a435cd4e8b32092d37b7bf7de5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65554
We're relying on JIT based shape inference and not using the TE
implementation.
Question to the audience: we set `hasBroadcasts_` in that function, but
this function was almost never invoked. Do we behave correctly in the
presence of rand-calls and broadcasts?
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D31148925
Pulled By: ZolotukhinM
fbshipit-source-id: 2898a57e389ea0950163122089d0fec3d92701c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345
FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165
Test Plan:
CI
perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.
Reviewed By: hlu1
Differential Revision: D31027361
fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65552
This PR is mostly a verbatim move of several functions to different
files. The goal is to have more consistency in what resides where.
With this PR:
* All `compute*` functions defining how a given operator needs to be
lowered to TE IR will reside in `operators/*.{cpp,h}`.
* Auxiliary functions for these functions will reside in
`operators/misc.cpp`. `compute*` functions for ops not belonging
anywhere else can also go to that file.
* `operators/unary.*` is renamed to `operators/pointwise.*` and now
includes functions like `computeTwoOperands`.
* `kernel.*` now contains *only JIT-related* logic and implementations of
`TensorExprKernel` methods.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31148923
Pulled By: ZolotukhinM
fbshipit-source-id: e36ad8e779b8d30a33b49ea4ebf6d6a7438989f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551
Previously we had a big switch on Op kind to decide how to lower a given
JIT operator to NNC. This PR changes this switch to a hash table lookup.
Why? This helps us with at least two things:
1) With this approach we can easily check if we know how to handle a
given node in advance - i.e. we can inspect the entire graph and tell
whether it's possible to compile it or not without actually trying to do
that and dying in the middle. This would allow us to, say, provide
user-friendly error messages in AOT workflow.
2) We can switch to use schema instead of op kind to determine correct
lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963)
and using it instead of schema can lead to bugs.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31148926
Pulled By: ZolotukhinM
fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65550
This PR adds the source files and the class for the registry, subsequent
PRs actually port existing lowerings to this mechanism.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31148922
Pulled By: ZolotukhinM
fbshipit-source-id: 4c087b22ee898d5a5a18a5d2a4bb795aa2ffd655
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65549
Previously it had a special handling, with this change it follows the
same mechanism as other ops.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31148924
Pulled By: ZolotukhinM
fbshipit-source-id: 572d8ae5e123e7a0e2a656154d7bd0f73c785a06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64750
conv2d bias is optional. It will be ArgNone in processing of the graph.
This bias is prim::constant NoneType, so we do not know shape at the moment of constant binding.
This adding it as a constant zeros Tensor at the moment of graph processing => for that adding `std::vector<TensorExprKernel::ConstantDescr>& constants and std::vector<at::Tensor>& constant_tensors` to `computeOperandValue` as it is not in `TensorExprKernel`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30842101
Pulled By: IvanKobzarev
fbshipit-source-id: 88020f6934e43fe606f8eae928b7e21b7c3f15f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887
BufHandle has exactly the same functionality and should be used instead.
Differential Revision:
D30889483
D30889483
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64828
Also, make `removeUnusedSelfArgument` more consistent with other passes
by mutating the graph in-place rather than returning a copy.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D30870776
Pulled By: ZolotukhinM
fbshipit-source-id: 4873f01b013921143a5aa43746d655a2d8d620c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64717
This also exposed several bugs, which are fixed in this PR.
Differential Revision:
D30826408
D30826408
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: a67ec5739aceed9ffdf0d24f77eb3787cefe4560
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64627
This fixes the root cause of S242719
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D30801686
Pulled By: navahgar
fbshipit-source-id: b6d3ebdc7eb57116eaced53c2f35c7798bb17e80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64589
Adding softplus operator lowering for NNC. Enabling element wise fusion as well.
Test Plan: Added a test in test_jit_fuser.py
Reviewed By: bertmaher
Differential Revision: D30736449
fbshipit-source-id: 6c5fc3bceb5cef2322ecd4449f827e4af018ea93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64516
If fuser compilation fails due to a bug (which should be highly
unlikely at this point) we want to direct the user how to unblock themselves by
disabling fusion, in addition to requesting that they report a bug.
ghstack-source-id: 137398537
Test Plan: existing tests
Reviewed By: ZolotukhinM
Differential Revision: D30758051
fbshipit-source-id: 98be89f1b1d4fb3bc816f5b2634c618b9297930e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64077
We were assuming kernel dimensions fit in 32 bits (the old fuser made
this assumption too), but we should be able to support 64.
ghstack-source-id: 136933272
Test Plan: unit tests; new IR level test with huge sizes
Reviewed By: ZolotukhinM
Differential Revision: D30596689
fbshipit-source-id: 23b7e393a2ebaecb0c391a6b1f0c4b05a98bcc94
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63923
The input graph can contain constants whose names contain special characters. So, all names of constants in the input graph need to be sanitized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63990
Reviewed By: ZolotukhinM
Differential Revision: D30558432
Pulled By: navahgar
fbshipit-source-id: de5b0c23d50ee8997f40f2c0fc605dda3719186f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63776
I reverted this out of an abundance of caution because some test
failures occurred, but they were all due to precision issues fixed lower in
this stack. Let's try again.
I've rolled the elimination of the allow-parallelism-in-fusions toggle into
this diff since they're pretty tightly coupled.
ghstack-source-id: 136529847
Test Plan: CI
Reviewed By: huiguoo
Differential Revision: D30484555
fbshipit-source-id: 38fd33520f710585d1130c365a8c60c9ce794a59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587
Now that there is no classes using KernelArena for memory management we
can remove it.
Differential Revision:
D30429115
D30429115
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586
This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.
After this change nothing uses KernelScope/KernelArena and they can be
safely removed.
Differential Revision:
D30429114
D30429114
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195
This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.
The changes are mechanical and should not affect any functionality.
With this PR, we're changing the following:
* `Add*` --> `AddPtr`
* `new Add(...)` --> `alloc<Add>(...)`
* `dynamic_cast<Add*>` --> `to<Add>`
* `static_cast<Add*>` --> `static_to<Add>`
Due to some complications with args forwarding, some places became more
verbose, e.g.:
* `new Block({})` --> `new Block(std::vector<ExprPtr>())`
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30292779
Pulled By: ZolotukhinM
fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336
This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.
This is the first step in making all NNC mutations in-place.
Test Plan: Imported from OSS
Reviewed By: iramazanli
Differential Revision: D30049829
Pulled By: navahgar
fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60804
The lowerings are stored as a map c10::Symbol -> std::function and the
signature of thoese functions match the signature of
`computeOperandValue`. Custom lowerings have higher priority over the
standard ones, i.e. we can redefine how a given op is lowered.
In general this feature is aimed at unblocking users whose models
contain ops that are not yet supported by NNC - it allows to quickly add
a custom lowering for a given op.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D29409580
Pulled By: ZolotukhinM
fbshipit-source-id: e8e8dc9d3cb9155cfbf5c08a4216ba1b5b791a60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59347
We had external call wrappers for them, but they were not used in NNC.
This PR adds lowerings using these ext calls and fixes some bugs in
them.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D28853832
Pulled By: ZolotukhinM
fbshipit-source-id: 1718400368e1a9cf3f19180ee2290a4ed9c99d41
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59346
Currently JIT has a pass to propagate shapes, but doesn't have a
capability to fill in strides and dtypes. This PR works around that by
assuming default dtype to be Float and strides corresponding to
contiguous layout, unless otherwise specified. Ideally, we won't need
this, and this is done simply as a workaround unless the corresponding
features are implemented on JIT side.
This is required for AOT compilation of mobilenet v3 with NNC.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D28853831
Pulled By: ZolotukhinM
fbshipit-source-id: 81adb59409684f39b444909ab8ec58ee4a39d496
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59988
As we broaden operator support, putting all the implementations into
kernel.cpp is getting unwieldy. Let's factor them out into the "operators"
subdirectory.
This diff is big but it's entirely code movement; I didn't change anything,
other than to expose a few utilities in kernel.h.
ghstack-source-id: 131405139
Test Plan: CI
Reviewed By: ZolotukhinM
Differential Revision: D29115916
fbshipit-source-id: ba0df1d8dd4a108b584da3baf168407e966b2c78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59754
Also, if inputs are contiguous, use their Placeholders
directly rather than generating contiguous Tensors from them.
The rationale for this change is that aten::matmul and aten::conv2d
support transposed inputs; if NNC generates a physical transpose to
perform an external call, performance will be strictly worse than not
fusing (sometimes dramatically so, as in the attached benchmark).
Test Plan: benchmark
Reviewed By: ZolotukhinM
Differential Revision: D29010209
fbshipit-source-id: da6d71b155c83e8d6e306089042b6b0af8f80900
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508
An assert that was triggering in a previous version is now relaxed to
take 0-dim tensors into account.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28918342
Pulled By: ZolotukhinM
fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae