Commit Graph

73 Commits

Author SHA1 Message Date
Edward Yang
a894fff265 Back out "Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API"
Summary: Original commit changeset: 636e8a11afc6

Test Plan: export to OSS

Reviewed By: malfet

Differential Revision: D21170502

fbshipit-source-id: e8f35f103c4924aedbcaaf868475008d24bdeeab
2020-04-22 09:18:23 -07:00
James Reed
2ccdc39dce Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API
Test Plan: revert-hammer

Differential Revision:
D21089648

Original commit changeset: 8d54329c1252

fbshipit-source-id: 636e8a11afc628a4cdae9d44824985c10c70555e
2020-04-21 12:21:45 -07:00
Edward Yang
01100cb477 Put TORCH_LIBRARY in torch/library.h; add custom class API (#36742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36742

Now, you can define a custom class inside a TORCH_LIBRARY block.
It looks very similar to what you did before.  Instead of

```
static auto m = torch::class_<Class>("Namespace", "Class").def("foo", foo);
```

you write

```
TORCH_LIBRARY(Namespace, m) {
  m.class_<Class>("Class")
    .def("foo", foo);
}
```

All the old usages still work, but at some point we should start
updating the tutorials when we're ready to go 100% live with the
new pybind11 style API.

custom class API previously lived in torch/ folder and in torch
namespace, so for consistency, the new TORCH_LIBRARY also got
moved to torch/library.h The definition of Library::class_ is in the
bottom of that header because I need all of the class_ constructors
available, but there is a circular dependency between the two headers.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D21089648

Test Plan: Imported from OSS

Pulled By: ezyang

fbshipit-source-id: 8d54329c125242605336c22fa1642aae6940b507
2020-04-21 10:05:21 -07:00
Edward Yang
e29348f828 Switch to pybind11 style registration function API. (#36258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36258

Previous we had a && chaining style API.  There are some downsides to
this API:

- It's easy to forget the 'static' qualifier in front, leading to
  subtle ODR bugs.
- It is not compatible with torchbind class_ definitions, as these
  need multiple levels of chaining.  So in practice people end
  up having to define multiple static initializers, one per class.
- It's not like pybind11.
- There's no way to conveniently get the file and line number of
  the registration, as there is no macro point in the API.
- The old API doesn't really encourage people to put all of their
  definitions for a library in one place, and to give a custom
  namespace for it.  Similarly, the old API wasn't very DRY, because
  you had to keep repeating the namespace/dispatch key you
  were writing implementations for.

The new API is modeled exactly off of the PYBIND11_MODULE macro:
you write:

```
TORCH_LIBRARY(aten, m) {
  m.def("aten::add(Tensor self, Tensor other) -> Tensor");
  ...
}
```

in a non-chaining fashion, and under the hood the macro expands to
define a function, and define a static initializer that allocates
c10::Library (previously called c10::Module, but we renamed it
to avoid confusion with the existing NN module concept), passes
it to your function, and then retains it for the rest of the lifetime
of the program.  Specification of the namespace is mandatory,
and in later commit I plan to make it a hard error to TORCH_LIBRARY
the same library name twice.

If you are specifying an implementation for an existing operator
(e.g., you're the XLA backend, or even if you're just putting
registrations for implementations at the implementation site),
you should use TORCH_LIBRARY_IMPL, which instead takes a backend
argument (instead of namespace) and can be used to specify an
implementation for a backend.  Unlike TORCH_LIBRARY, you can do
as many of these as you want for a backend.

This needs updates to the mobile code analyzer.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929257

Pulled By: ezyang

fbshipit-source-id: ba04d78492e8c93ae7190165fb936f6872896ada
2020-04-16 10:44:21 -07:00
Jiakai Liu
f98e0a099a [pytorch] handle pybind11 style registration API with code analyzer (#36607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36607

PR #36258 and subsequent PRs in the stack switch c10 registrations to
the new pybind11 style registration API. One notable difference from old
c10 registration API is that, operator's namespace is no longer in op
schema string, e.g. "aten::" will be factored out from "aten::conv",
"aten::emtpy" and etc. The namespace string will be declared at the
beginning of registrations with TORCH_LIBRARY / TORCH_LIBRARY_IMPL
macro.

A rather simple fix is to extract namespace string from the name of
enclosing function of registrations, as the TORCH_LIBRARY macro will
always create an init function (per namespace) by appending namespace
string to a common prefix.

Another side effect of the API change is that it adds some debug string
constants to the registration API, and because of factoring out the
namespace part from op name, there is no longer an effect way to
differentiate between real op name and debug strings. A simple
workaround is that we only keep the first string constant it encounters
while BFSing the LLVM IR - the real op name is directly passed into the
registration call while the debug string is indirectly passed via
CppFunction.

These new assumptions might be broken by future changes but it's so simple
to implement to unblock the API work.

Test Plan: Imported from OSS

Differential Revision: D21026008

Pulled By: ljk53

fbshipit-source-id: c8c171d23aaba6d6b7985d342e8797525126a713
2020-04-15 11:03:41 -07:00
Jiakai Liu
9cac2b83d9 [pytorch] improve code analyzer to dump ops called from c++ functions (#35941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35941

The key step of mobile custom build is to find out ops used by specific
model, with which it can produce a tailored build of optimal size.

However, ops can not only be called from TorchScript model but can also
be called from C++ code directly, e.g.: via torch::jit:: APIs. With
static dispatch, ops called this way will be statically linked into client
code. With dynamic dispatch, we need obtain & keep these ops explicitly.

This PR improves static code analyzer to dump ops that are called from
visible c++ symbols matching specific regex. This provides a mechanism
to solve the custom build problem with dynamic dispatch.

It starts with dumping ops that are callable from functions in torch::jit
namespace and include them in custom build with dynamic dispatch. We can
extend it to analyze custom code / to refine the set of JIT APIs that
are relevant, and etc. This is just a preliminary version. We need
improve its usability for more general purpose.

Test Plan: Imported from OSS

Differential Revision: D20835166

Pulled By: ljk53

fbshipit-source-id: a87cfb22b34f89545edd0674a5dfca6b7cff2b0c
2020-04-14 23:21:19 -07:00
Edward Yang
2de3e491a8 [RELAND] Add temporary impl_UNBOXED syntax sugar for unboxed-only defs. (#36223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36223

Previously #35714

There are a lot of unboxed only defs.  We're committed to removing
them at the end of the half but as I am about to do a lot of porting
to the new API, let's get them into a form where they're easy to
remove.  This is a new overload impl_UNBOXED that will pass
the function pointer straight to CppFunction::makeUnboxedOnly

I don't attempt to make the _UNBOXED API complete; in particular,
catchall declarations don't get this sugar (as there are very few
of them).

To get some coverage of _UNBOXED API for code analysis, I switched
one of our unboxed tests to be an impl rather than a def.  This
shouldn't materially affect coverage.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929259

Pulled By: ezyang

fbshipit-source-id: 72d2061b6c8a6afbcd392b47f53ade18de2f9184
2020-04-09 14:58:33 -07:00
Edward Yang
ef07bb65e9 [RELAND] Add DispatchKey impl overload; remove use of torch::dispatch (#36222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36222

Reland of #35706, with fixes to code analyzer.

It is extremely common to define implementations of operators at a
specific dispatch key, so we add an overload to impl specifically for
this case.  I then delete most uses of torch::dispatch

dispatch_autograd call sites can't make use of this overload.  So
instead the new preferred way to specify something as autograd is to
pass kAutograd as the dispatch key (short form, analogous to kCPU/kCUDA
which we support today).

I flip flopped about whether or not kAutograd should have the type
DispatchKey or some other type (to help better encapsulate the
DispatchKey enum); this is more direct and I can't think of any
BC problems from this usage.

Some other reorganization I did:
- I renamed all of the worker functions in op_registration to have
  a leading underscore and made them private, just to make it more
  clear what the public versus private API were (the private API
  shouldn't be used by users because it doesn't come with && overloads)
  Note that this means I needed to adjust the regex in the
  code analyzer, because
- In a few places where I was touching lines already, I replaced
  full DispatchKey typed out enums with shorter kFoo names, similar
  to kAutograd but I didn't publish these globally.
- Code analyzer now prints a unified diff, and in the other order
  (because I tend to think of the diff as reporting how the /new/ result
  is different)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929256

Pulled By: ezyang

fbshipit-source-id: c69b803d2b3a1a8aff70e14da33d3adec5239f13
2020-04-09 14:56:55 -07:00
Edward Yang
4c8e38c6d7 Minor doc improvement for code_analyzer (#36177)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36177

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20904241

Pulled By: ezyang

fbshipit-source-id: b13584dfdb1f852e451b1295c0d4cd4a7f53712f
2020-04-08 08:14:50 -07:00
Jiakai Liu
1783ea43e7 [pytorch] deprecate code analyzer -closure option (#35179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35179

Transitive dependencies are calculated in python script for both OSS custom build and BUCK selective build, so change the c++ analyzer to take -closure=false by default and remove the param from callsites.
ghstack-source-id: 100637068

Test Plan: CI

Differential Revision: D20586462

fbshipit-source-id: 195849b71cda6228a49ecd2215d3fb8b4da7f708
2020-03-22 14:36:42 -07:00
Jiakai Liu
064c478453 [pytorch] register c10 ops for static dispatch to unblock c10 boxing
Summary:
PR #32521 broke static dispatch because some ops are no longer
registered in register_aten_ops_*.cpp - it expects the c10 registers in
TypeDefault.cpp / CPUType.cpp / etc to register these ops. However, all
c10 registers are inside `#ifndef USE_STATIC_DISPATCH` section.

To measure the OSS mobile build size impact of this PR:
```
 # default build: SELECTED_OP_LIST=MobileNetV2.yaml scripts/build_pytorch_android.sh armeabi-v7a
 # mobilenetv2 custom build: scripts/build_pytorch_android.sh armeabi-v7a
```

- Before this PR, Android AAR size for arm-v7:
* default build: 5.5M;
* mobilenetv2 custom build: 3.2M;

- After this PR:
* default build: 6.4M;
* mobilenetv2 custom build: 3.3M;

It regressed default build size by ~1M because more root ops are
registered by c10 registers, e.g. backward ops which are filtered out by
gen_jit_dispatch.py for inference-only mobile build.

mobilenetv2 custom build size regressed by ~100k presumably because
the op whitelist is not yet applied to things like BackendSelectRegister.

Differential Revision: D20266240

Test Plan: Imported from OSS

Pulled By: ljk53

fbshipit-source-id: 97a9a06779f8c62fe3ff5cce089aa7fa9dee3c4a
2020-03-20 20:07:15 -07:00
Jiakai Liu
3c042a6ab9 [pytorch][mobile] support for custom mobile build with dynamic dispatch (#34055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34055

Enable custom mobile build with dynamic dispatch for OSS build.

It calls a python util script to calculate transitive dependencies from
the op dependency graph and the list of used root ops, then pass the
result as the op registration whitelist to aten codegen, so that only
these used ops are registered and kept at link time.

For custom build with dynamic dispatch to work correctly, it's critical
to have the accurate list of used ops. Current assumption is that only
those ops referenced by TorchScript model are used. It works well if
client code doesn't call libtorch API (e.g.  tensor methods) directly;
otherwise the extra used ops need to be added to the whitelist manually,
as shown by the HACK in prepare_model.py.

Also, if JIT starts calling extra ops independent of specific model,
then the extra ops need to be added to the whitelist as well.

Verified the correctness of the whole process with MobileNetV2:
```
TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh
```

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D20193327

Pulled By: ljk53

fbshipit-source-id: 9d369b8864856b098342aea79e0ac8eec04149aa
2020-03-03 19:25:16 -08:00
Jiakai Liu
c596ec7eb3 [pytorch] update code analyzer script to cover new c10::Module::def API (#33975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33975

Currently the code analysis script doesn't go beyond the scope of the
registration API call, i.e. calling registration via a wrapper will not
be covered by the analysis - currently the new API is essentially a
wrapper around old API.

Simply adding the new API signature to the registration API pattern can
solve the problem for now. We might need change the analyzer code if
things change significantly in the future.

Test Plan:
- update test project to use the new API;
- run analyzer against pytorch codebase;

Differential Revision: D20169549

Pulled By: ljk53

fbshipit-source-id: c7925fa0486eee18f07e791a38c32152fee59004
2020-02-29 10:29:45 -08:00
Jiakai Liu
495c1df510 [pytorch] convert code analyzer to a binary (#33102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33102

Add a simple main() to build code analyzer as a binary. This enables
easier integration with FB internal build environment.
ghstack-source-id: 97958658

Test Plan: - CI

Differential Revision: D19798560

Pulled By: ljk53

fbshipit-source-id: 126230e3bf7568046a309e8a6785230f820e0222
2020-02-10 14:46:29 -08:00
Jiakai Liu
9e0ce72e9e [pytorch] change op dependency output to use double-quoted strings (#32464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32464

Changed to double quoted strings to make FB linter happy.

Test Plan: Imported from OSS

Differential Revision: D19507859

Pulled By: ljk53

fbshipit-source-id: fa70535c7fbea73214b3b0efb0532184b5ee6854
2020-01-24 15:27:28 -08:00
Jiakai Liu
fd1a4f18ee [pytorch] update code analyzer build.sh to handle srcs with same name (#32525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32525

Before calling static code analyzer we need link all bitcode files into
a single module. Current approach is a bit hacky: cmake still calls "ar"
to pack bitcode files into archives, then we manually unpack these
archives and call llvm-link.

Turns out libtorch_cpu.a contains a few files with same name, e.g.:
```
aten/src/ATen/native/SoftMax.cpp
aten/src/ATen/native/mkldnn/SoftMax.cpp
```

"ar x" will only keep one of them and cause inaccurate analysis result.

Use this temporary hack to workaround the problem. Ideally should merge
this step into cmake (e.g. directly calling llvm-link to produce target
output?).

Differential Revision: D19530533

Pulled By: ljk53

fbshipit-source-id: 94b292c241abaaa0ff4a23059882abdc3522971e
2020-01-24 12:37:30 -08:00
Jiakai Liu
0ac31a99be run code analysis against mobile interpreter (#32276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32276

Include mobile interpreter in mobile code analysis pass, which has some
manually registered ops in temporary namespaces.

The mobile interpreter is still under development and these ops will be
removed in the future. This is a temporary step for internal build
experiment.

Test Plan: Imported from OSS

Differential Revision: D19426818

Pulled By: ljk53

fbshipit-source-id: 507453dc801e5f93208f1baea12400beccda9ca5
2020-01-17 17:21:28 -08:00
Jiakai Liu
346005d3ed integrate op dependency analysis process into CI
Summary:
Custom build and internal build will depend on the analysis result so
let's make sure it doesn't break.

Tested locally with LLVM-5.0, LLVM-7 and LLVM-8.

Test Plan: - check CI result

Differential Revision: D18894637

Pulled By: ljk53

fbshipit-source-id: 657854e4bed85a84907e3b6638d158823a56ec80
2020-01-10 11:37:37 -08:00
Jiakai Liu
fc598f9023 generate op dependency graph as python code
Summary:
Add support to print op dependence as python code so that both custom
build script and BUCK can import it without yaml parser.

Test Plan:
- generate the file:
```
ANALYZE_TORCH=1 FORMAT=py DEPLOY=1 tools/code_analyzer/build.sh -closure=false
```

- load the file in python:
```
python
>>> from tools.code_analyzer.generated.torch import TORCH_DEPS
>>> print(TORCH_DEPS)
```

Differential Revision: D18894639

Pulled By: ljk53

fbshipit-source-id: e304d0525a07a13cf6e8a9317cd22637200d044c
2020-01-02 20:26:28 -08:00
Jiakai Liu
baccd26df7 update code analyzer script to handle splitted torch libraries (#30864)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30864

Change it to handle all archive files under install folder.

Test Plan:
```
ANALYZE_TEST=1 CHECK_RESULT=1 tools/code_analyzer/build.sh
ANALYZE_TORCH=1 tools/code_analyzer/build.sh
```

Differential Revision: D18850317

Pulled By: ljk53

fbshipit-source-id: 7c57ae16c82b6ded53aa7df385f3b6074190fc04
2019-12-06 14:38:30 -08:00
Jiakai Liu
be55874f2c style fixes to code analyzer (#30808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30808

Addressed some comments on #29550 after it's landed.

Test Plan:
```
LLVM_DIR=... ANALYZE_TEST=1 CHECK_RESULT=1 tools/code_analyzer/build.sh
LLVM_DIR=... ANALYZE_TORCH=1 tools/code_analyzer/build.sh -closure=false -debug_path=true
```

Differential Revision: D18835100

Pulled By: ljk53

fbshipit-source-id: 991d292ddc0211a88b04d0bdc24719f471c7786e
2019-12-05 11:25:37 -08:00
Jiakai Liu
c0299d2707 add LLVM code analyzer in order to replace static dispatch
Summary:
[Why static dispatch]
Static dispatch was introduced to allow stripping out unused ops at link
time (with “gc-sections” linker flag) for mobile build.

The alternative approaches to do "non-static" dispatch are:
* virtual methods - old ATen dispatcher, which has already been deprecated;
* registry pattern - used by caffe2, c10 and JIT;

However, none of them are “gc-sections” friendly. Global registers are
root symbols - linker cannot strip out any op if we use registry pattern
for mobile.

[Why static dispatch isn’t great]
* One more code path to maintain;
* Need recompile framework to add new backends/ops;
* Doesn’t support AutoGrad yet thus blocks on-device training;

[Static Code Analysis]
This PR introduces a LLVM analysis pass. It takes LLVM bitcode /
assembly as input and generates dependecy graph among aten ops. From a
set of root ops used by a model, we can calculate transitive closure of
all dependent ops, then we can ask codegen to only register these ops.

[Approach]
To generate the dependency graph it searches for 3 types of connections in
LLVM bitcode / assembly:
 1) op registration: op name (schema string literal) -> registered function;
 2) regular function call: function -> function;
 3) op invocation: function -> op name (schema string literal)

For 2) it uses similar algorithm as llvm::LazyCallGraph - not only looks into
call/invoke instructions but also recursively searches for function pointers
in each instruction's operands.

For 1) and 3) it searches for connections between operator name string
literals / function pointers and c10 op registration/invocation API calls in
LLVM IR graph via "use" edges (bi-directional):
 1. llvm::Value has "users()" method to get other llvm::Value nodes that use
    the value;
 2. most of types derive from llvm::User which has "operands()" method to get
    other llvm::Value nodes being used by the value;

[Limitation]
For now the search doesn't go beyond the function boundary because the
reference to op name string literals and c10 op registration/invocation
APIs are almost always in the same function.

The script uses regular expression to identify c10 API calls:
* op_schema_pattern="^(aten|quantized|profiler|_test)::[^ ]+"
* op_register_pattern="c10::RegisterOperators::(op|checkSchemaAndRegisterOp_)"
* op_invoke_pattern="c10::Dispatcher::findSchema|callOp"

If we create helper function around c10 API (e.g. the "callOp" method
defined in aten/native), we could simply add them to the regular expression
used to identify c10 API.

[Example]
In the following example, it finds out:
 1) the registered function for "quantized:add" operator;
 2) one possible call path to at::empty() function;
 3) the called operator name "aten::empty":

- "quantized::add"
- c10::detail::wrap_kernel_functor_unboxed_<at::native::(anonymous namespace)::QAdd<false>, at::Tensor (at::Tensor, at::Tensor, double, long)>::call(c10::OperatorKernel*, at::Tensor, at::Tensor, double, long)
- at::native::(anonymous namespace)::QAdd<false>::operator()(at::Tensor, at::Tensor, double, long)
- void at::native::DispatchStub<void (*)(at::Tensor&, at::Tensor const&, at::Tensor const&), at::native::qadd_stub>::operator()<at::Tensor&, at::Tensor const&, at::Tensor const&>(c10::DeviceType, at::Tensor&, at::Tensor const&, at::Tensor const&)
- at::native::DispatchStub<void (*)(at::Tensor&, at::Tensor const&, at::Tensor const&), at::native::qadd_stub>::choose_cpu_impl()
- void at::native::(anonymous namespace)::qadd_kernel<false>(at::Tensor&, at::Tensor const&, at::Tensor const&)
- at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
- at::TensorIterator::build()
- at::TensorIterator::fast_set_up()
- at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>)
- "aten::empty"

[How do we know it’s correct?]
* Built a test project that contains different op registration/invocation
  patterns found in pytorch codebase, including both codegen and non-codegen
  cases.
* Tried different optimization flags “-O0”, “-O3” - the result seems to
  be stable.
* Filtered by common patterns: “aten::”, “at::”, “at::native”,
  “at::CPUType”, “at::TypeDefault” - manually checked the relationship
  between function schema strings and corresponding implementations were
  captured.
* It can print instruction level data flow and show warning message if it
  encounters unexpected cases (e.g.: found 0 or multiple op names per
  registration/invocation API call, found 0 registered functions, etc).
* Verified consistent results on different linux / macOs hosts. It can
  handle different STL library ABI reliably, including rare corner cases
  for short string literals

[Known issues]
* Doesn’t handle C code yet;
* Doesn’t handle overload name yet (all variants are collapsed into the
  main op name);

Test Plan:
```
LLVM_DIR=... ANALYZE_TEST=1 CHECK_RESULT=1 scripts/build_code_analyzer.sh
```

Differential Revision: D18428118

Pulled By: ljk53

fbshipit-source-id: d505363fa0cbbcdae87492c1f2c29464f6df2fed
2019-12-04 01:02:33 -08:00
Jiakai Liu
d456a538f9 op dependency analysis bash driver
Summary:
Move the shell script into this separate PR to make the original PR
smaller and less scary.

Test Plan:
- With stacked PRs:
1. analyze test project and compare with expected results:
```
ANALYZE_TEST=1 CHECK_RESULT=1 tools/code_analyzer/build.sh
```

2. analyze LibTorch:
```
ANALYZE_TORCH=1 tools/code_analyzer/build.sh
```

Differential Revision: D18474749

Pulled By: ljk53

fbshipit-source-id: 55c5cae3636cf2b1c4928fd2dc615d01f287076a
2019-12-04 00:12:24 -08:00