* Factor python dependency out of interpreter
* Remove NO_PYTHON for the autograd engine
If there is no python bindings, then a default Engine is constructed
the first time it is requested.
If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.
Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.
* Fixing AlexNet test which is skipped in CI
Improve script builtin checking using schema
* This add aten_schema.h which provides a barebones amount of type and
argument information about each builtin operator
* emitBuiltinCall is updated to use this information rather than
aten_dispatch to ensure the operator is correct.
* handling of keyword and position arguments now matches python behavior
* There is no longer a requirement that kwargs be constant or that the
attributes of an op must be entirely constant or non-constant
* compiler now constructs a non-attributed version of the op first and
then turns it into the constant-attribute version if all attributes
are constants.
* default arguments for builtins now work
* SugaredValue::call and similar functions now have SourceRange information
for their arguments so that error reporting is more accurate
Notes:
* This does not try to merge the builtin checking with python arg parser.
Given that we will eventually have C10 schema which will replace aten_schema,
we will eventually have a C++ description of the schema and working of that
description directly will be the easiest form to understand.
* python function calls and script method calls do not support keyword arguments yet.
When we add this support we should refactor the handling in tryEmitSchema
that resolves keywords into a common function.
* default arguments work
* keyword arguments to builtins work (still need to extend to calling python and other script methods)
* much better error reporting for incorrect builtins
Lift any constants to attributes on nodes when possible
* Schema is usable internally in the compiler as
the function signatures of script functions as well as for builtin
operators.
* Adds a List[T] class to better represent the arguments to cat/stack
as a type rather than with custom checking.
* Support kwargs for calls of script methods
A future commit will be needed to add support for:
* calls to script _functions_ which are currently are GraphExecutors without schema info.
* kwargs to python functions, which will require refactoring python op
* this removes the flag controlling whether the interpreter works on variables.
* now the interpreter _always_ works on variables
* constants in the IR are still _always_ non-variables, and an assert was added to ensure this.
* as_tensor was split into as_variable and as_tensor since it is sometimes used
to construct constants in the IR
* I tried changing the IR to also always use variables but that change was much more
cross cutting and fragile and I never got it working
The long-term fix is to remove the handling-creating pathways and
remove all the modes from PythonOp making it into an op that simply
calls a PyObject. Right now ONNX expects PythonOp to hold a
nn.Function, not a generic callable, so completely removing the legacy
pathway will also require changes to how ONNX symbolics are found.
* Implement range for loop in script
* Fix handling of boolean constants
* Use WithInsertPoint
* Allow dynamic max trip count
* fix symbols
* Fix argument order
* fix test
* Add insert{Input,Output} APIs and use them
* Factor out condition stuff
* clang-format
* Address remaining comments
* Fix tests
* Implement script in AST frontend
* Namespaced symbols
- Our interned strings now have structure, "ns::symname" rather than just
"symname" before. We support efficient namespace testing for uniques
by encoding the namespace in one byte in the Symbol internal representation.
See torch/csrc/jit/interned_strings.h for a more in-depth implementation
discussion.
- All uses of ksymbol are now attr::symbol (or some appropriate namespace).
The valid namespaces are prim, attr, onnx and aten.
- Symbol is bound in Python as a qualified string "attr::symbol", EXCEPT for the
attribute setting/getting API, whose symbols must always be attr
symbols; they get special cased to assume strings are passed.
There's a little bit of naughtiness in the implementation, maybe you know
how to solve it.
- However, the g.op() convenience function assumes that you're generating
ONNX operators, unless you explicitly qualify.
- All ATen operators and nodes have built-in interned strings generated
for them, so you should never have to write a string literal ever again.
The tracing code is adjusted to use it.
- ONNX exporter now properly tests to see that all operators are in
onnx namespace before accepting the export. This is way more
robust than the previous exporter, which would be willing to
export capitalized operators which were not actually ONNX operators.
- A slight organizational change for symbolic.py; this module now ONLY
contains aten operators. In particular, the exporter for Constant
has moved into utils.py (along with Undefined, from the C++ side),
since primitive ops get "special treatment."
- The un-inplacing logic in recording is more robust, so that we don't
delete a trailing underscore from __and__. This never affected us
before because we didn't have any tests for it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Add script::Module C++ class to represent script modules
switch AST -> IR conversion to work on Modules/Methods rather than raw graphs
function-only AST -> IR conversion is just a simplified case where there is
only one module with a single method and no parameters.
introduce SugaredValue in compiler.h to represent values in scope in a script
function that are not first-class and that get desugared. This is used to
represent the module's self parameter, as well as python function calls,
and method calls on tensor
provide a Python ScriptModule that provides a nice API on top of script::Module
allowing for the definition of script modules with methods, parameters,
and submodules
Not in this PR but intended for the future:
ScriptModule actually subclasses nn.Module, with most methods implemented
Unification of tracedmodule and script module functionality into one container class.
Detailed changelog:
* Switch compiler over to using Module, but don't
use them yet.
* Remove intermediate attribute encoding in compiler
* Create SugaredValue object to handle resolution
of compiled module.
* switch to_ir to modules, implement Select
* hacky python wrappers
* Private ScriptModule
* Add `define` to script module
* Attributes use TK_LIST_LITERAL
this anticipates adding a real list literal expression to the language.
* Add a metaclass to make sure script stubs are registered
* Add a test
* Doc createResolutionCallback
* Docs and minor editing
* Address PR comments
* Document
* Fix unicode issue
This PR adds the possibility to build the C++ parts of autograd and jit, with no dependency on Python.
The goal is to allow taking a PyTorch IR representation (a tree s-expr) and running it with provided inputs.
Prerequisite: build PyTorch so that codegen runs once.
Instructions:
cd tools/cpp_build
bash build_all.sh
This will build libtorchjit and torchjit_test in tools/cpp_build/build/torchjit-build. The latter basically runs the code in test_jit.cpp for now.
While writing the PR, it turned out that a few of Python.h includes were redundant. They were removed here (PyTorch tests still pass on my machine, we'll see CI).
* Introduce Python-free builds of autograd and jit
* Remove NO_PYTHON ifdef in functions/special
* Fix a bug gen_jit_dispatch.py
The `fromLast` function is confusing to understand since `fromLast(stack, 0)`
was actually invalid whereas `fromLast(stack, 1)` was the last element.
This created off-by-one bugs in gen_jit_dispatch for some operators.
This changes it to `peek(stack, i, N)` which treats the last `N`
elements of the stack as a list, and extracts element `i` of that list.
This usage reflects how `fromLast` was actually being used in the code.
`peekSlice(stack, i, len, N)` similarly treats the last N elements
as a list but extracts a slice. This enables use to get rid of
drop calls and simplify the dispatch logic.
* Add Python function calls to script
* Script compiler gains a `Resolver` object that runs when it does not understand a function call. This decouples the python resolution from the conversion to IR.
* Add source information to IR nodes
SourceRange information from the script is not propagated to IR nodes.
This information is only used in two places now: the interpreter
wraps errors that occur when an instruction executions and shape
propagation now reports errors on the line where it fails:
Traceback (most recent call last):
File "test/test_jit.py", line 1655, in test_script_error
bar(Variable(torch.rand(10), requires_grad=True), Variable(torch.rand(9), requires_grad=True))
RuntimeError:
The size of tensor a (10) must match the size of tensor b (9) at non-singleton dimension 0:
@torch.jit.script
def bar(c, b):
return c / b
~~~~~ <--- HERE
In the future, shape propagation should really not report any size
errors and instead just not propagate shapes and let the actual
execution fail. However, this is hard to accomplish while we still
depend on running the op to do shape propagation.
Support shape propagation with control-flow
* This allows us to enable optimization in the GraphExecutor for most
script tests.
* Changes Type to always be present (non-null) on a Value, removing `hasType()`
and `typeOption()`. A new type kind 'DynamicType' now represents when
a specific type has not been determined.
* If/Loop nodes propagate shapes/types in the simple cases where types of
outputs do not change depending on where control flows. In other
cases, we propagate DynamicType to indicate we do not know what
the shape will be.
* Remove the `cond` input to the body of Loop to simplify handling in
interpreter and shape propagation.
* Bugfix for zero-dim contiguousStridesOf
Additionally:
- add support for calling functions that are not methods in the Python frontend
- add an end-to-end test for the Python frontend
- add a capture_stdout helper for checking that `print` actually works
* Use stacks in the interpreter/aten_dispatch
Rather than have separate input/output lists,
the interpreter now works using a single stack.
Operators in the interpreter push/pop from the stack.
This allows ownership of tensors to transfer directly to an operator,
and an operator can drop the reference to a tensors as soon as it is
no longer needed. This is important for the GraphExecutor op,
which recursively runs the interpreter.
Once autograd is updated to pass variables to Function by value,
we will be able to ensure that we release ownership as soon as possible.
This commit also switches the interpreter to use a fake
tensor 'ContainerTensor' rather than at::Retainable to hold non-tensor
data in the interpreter. This allows us to use std::vector<at::Tensor>
for all registers, which is significantly less confusing than the
OwnedRetainables struct it was replacing.
* Add If and Loop to interpreter
* Preprocess loop to calculate where references to tensor should be dropped
* Add control instructions JumpZ/JumpNZ/Jump
* Switch from explicitly having stage structs to having a single list
of instructions with Store/Load instructions to take values off the
initial stack
* Make the interpreter tests executable rather than use expect files
* add a flag to interpreter code so that constants are variables
if the interpreter is running on variables.
* Add tensor_as to its own file
* Improve Function interface
* Undo tracer changes
* Fix bug in VariableType.set_history
* Rename function_counter and sequence_number to sequence_nr
* Clarify Function documentation
* Replace swap_next_edges with next_edges() getter
* Bring back set_gradient_edge
* Simplify special.cpp
* add_gradient_edge -> create_gradient_edge
* Add mutable getters for pre/post hooks
* Use make_variable with Edge
* Remove remove_gradient_edge in favor of detach_
* Fix documentation and remove create_gradient_edge friend method
* Canonicalize some includes
* Improve Variable interface
* Address comments from @apaszke and @colesbury
* string ::operator= is not noexcept
* Remove ir.h from tracer_state.h to improve build times
* Make Variable a struct and pack SavedVariable fields
* Implement as_variable_ref
* grad_fn_ptr() -> grad_fn_unsafe()
* Reduce hackiness of set_type hack
* Include variable.h and edge.h in tracer_state.h because it uses them
* class Variable -> struct Variable because Windows cant even
* Make Variable::output_nr uint32_t instead of int
* Add comment about tracing state
* Replaced more static_cast<Variable&> and improve docs
* Remove SavedVariable destructor and construct members in init list
* Clarify docs for Variable
* Variable::set_version -> set_version_counter
This adds the initial implementation of graph executor for the new JIT design. It includes a few python tests ensuring that nograd, backward, and double-backward cases work for simple examples and some corner cases. More work needs to be done to performance optimize as there are many extra copies and places where we hold onto variables longer than we should. These are noted in the comments.
Previously the side-effect free grad calculation was performed
using callbacks that could also override the decision to run a
function. However this had a few problems e.g. it forced us to iterate
over pretty much all functions in the graph and drop their buffers.
This patch improves the mechanism, by adding explicit support for this
kind of evaluation in execute(). It's safer, and the algorithm used to
decide which nodes have to be evaluated was replaced with a faster one.
* Further relax VariableFlags
* Allow a requires_grad=True trace to be used for a requires_grad=False
input by computing the gradient but they not connecting it to the
input.
* Enable CSE to de-duplicate WLM backwards pass code which calls sum twice.
* Fix a bug in the interpreter that frees a register too early when
it appears twice in a use list.
* [fuser] Follow all outputs to check if fusion is safe
This bug was introduced when we allowed fusion groups
to fuse together. Previously producers were forced to have a single
output, but now producers that are fusion groups can have multiple outputs.
So now we check the uses of all the outputs of a producer.
* [JIT] Fix handling of undefined inputs
It is not legal to call .data() on variable objects whose tensors
are undefined.
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().
In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()
Fixes#3627
* Fix another leak in pybind11 code.
This time caused by an upstream pybind11 bug:
https://github.com/pybind/pybind11/pull/1216
This changes causes the code to go down a non-buggy pathway.
* Relax verify of VariableFlags
If we trace with a defined tensor, but see a run with a undefined
tensors we now allow that run to happen, replacing the tensor with
zeros.
This also fixes a bug where stage 0 tensors were not
checked against their verify flags.
This change does _not_ handle all bad situations that can happen.
For instance if the first thing traced has a undefined tensor but
a later tensor is defined, then it will fail because the graph itself
does not contain the trace for the derivative of the tensor.
However it is possible to work around this later case by
dry-running the function:
z = Variable(...,requires_grad=True)
x,y = f(z)
(x.sum() + y.sum()).backward()
This adds a simple fusion backend for the CPU.
* Refactors CompiledFusionFunction to have two subclasses that handle
the compilation details of each backend.
* emit-compile-link-run cycle for the CPU
* simple single core loop to run the operation
* lift CUDA-only restrictions in the fuser, checks that fusion groups
are only on a single backend.
* Add interpreter support for Handles/PythonOp/CppOp
This treats Handles as a first-class type in the interpreter
since this turned out to be conceptually simpler than treating
them as a separate concept, which requires a second channel for
register allocating and moving data from one op to the next.
Notes:
* The refcounting nature of tensors is factored into its own base type
so that it can be shared with other refcounted types such as handle.
* Some methods redundant with TensorBase have been deleted from Tensor
* The interpreter uses raw refcounted handles. In addition to being
able to treat Tensors and Handles as the same base object, it removes
a lot of redundant refcounting as objects moved from tensors to input/
output lists.
* aten_dispatch has been updated to work directly on the raw refcounted
lists to avoid refcounting and duplicate lists.
* Removing jit_closure.cpp, The interpreter can now handle all pathways.
* Functions like `unsafeToTensorShare` describe how
ownership transfers in the interpreter. The `Steal` variants
take rvalue references as arguments, and invalidate those
arguments to prevent potential problems.
* Make TensorTemporary is not a subtype relationship because it is too easy to
do something horribly unsafe:
```
void foo(at::Tensor bar) {
// bar destructor call release on a temporary!
}
foo(TensorTemporary(retainable)); // structure slicing!
```
This commit adds a Value type similar to the one @ezyang suggested a while
ago for handling multi-return nodes.
Previously if we had a graph like:
a = op1(b)
c, d = op2(a)
Then its in-memory format would look like:
%0 = op1(b)
%1 = op2(%0)
%2 = select(%1, 0)
%2 = select(%1, 1)
Select nodes were used only to handle the multi-output case. In the
single-output case ops referred directly to their uses.
This required special handling for the single- and multi- output cases,
and was confusing when used with ONNX which distinguishes values (the
inputs/outputs of a node) from the nodes themselves (e.g. a Conv).
This commit adds the Node/Value distinction to the IR. In the example
above, `a`, `b`, `c`, and `d` are now Value objects, while `op1` and
`op2` are now Node objects. Inputs/Outputs to the graph are values.
* Nodes now always have multiple outputs, accessible through their `output()`
method.
* Methods exist for adding/removing outputs from a node.
* Nodes own their output Values, destroying a node destroys its outputs and it
is only valid to destroy a node when no uses of its outputs remain.
* Unlike select, Values do not appear in the nodes list.
* The method `node()` on `Value` retrieves its defining node. Calling it
is always valid. For inputs, its kind is "Param". Like "Return" there is a single Param
node representing all inputs.
* For single-output Nodes, the method `output()` retrieves the single
output Value, asserting that the node is in-fact single output.
* Functions are the same, but some functions like `type()` have moved to
Value.
* `replaceAllUsesWith` is now sanely defined for both Values and Nodes.
In the case of Nodes, it replaces all outputs of the node with the outputs
of the replacement node.
* stage is defined both on Node/Value. This is because Inputs require a stage.
* Apart from changing data types from Node->Value most passes remain the same.
Things that previously assumed single-output nodes now have to call output()
to get the node.
* This removes the uses = [...] field in the outputs because it was
getting confusing even before this commit when uses would refer to nodes,
but we print the names of Values. The lint pass validates the use list,
so printing it out seems less necessary.
* Add a JIT interpreter
The separate interpreter is used to graphs with a lower overhead than
converting them to autograd graphs. Some notes:
* does not support Handles/PythonOp/CppOp, these will be in a future commit
* jit_closure.cpp still exists and we fall back to it for now when
cannot handle something because of PythonOp/CppOp
* In order to support retain_graph=True, the interpreter can be cloned,
creating a copy that can be run with different arguments. This is
assumed to be the non-standard case so cloning is not particularly optimized.
No tensor _data_ is copied, but the at::Tensor list in the interpreter is.
If we hit problems, there is a lot we could do (such as register allocation)
to minimize the stuff that needs to be copied.
* Uses a pImpl pattern to keep implementation details out of its header file.
* Modifies the way getTensorOp works so that it reads/writes to already-existing
vectors, this prevents needing to realloc these buffers each time.
* Timings are here: https://gist.github.com/zdevito/5a20ac29fb1b9e449e693b67dc478127
This reduces overhead to about the same as running it in python.
It is about 10us faster to run the same thing using ATen directly.
* Code Mod
Interpreter -> InterpreterState
Function -> Code
Add other requested comments.
* RegList -> ListHandle<T>
Change the RegList functions to be safer by identifying the type of
each argument list, and checking that list insert does not try
to add to two different lists at once.
* Use exactly equal for interp tests