* Allow `__constant__` values in a ScriptModule to be used as attributes for builtin functions
* Fix bugs in @script loops
1. while loops run shape propagation multiple times until the shapes have converged.
There were two bugs here. (a) First the 'changed' condition was not checking if it actually
changed the output, and instead would mark changed = true if the two inputs were different.
This incorrect because the output of the block and the input of the block may always have different shapes.
Now it actually checks if it is about to change the output entry that it is writing to.
(b) expand nodes were being inserted into the graph even inside the while loop body. However, if
we iteratively discover that the input shape to one of these expands is actual dynamic, then
it was incorrect to insert the expand in the first place. This changes it so that we only insert expands
after we have converged on the shapes.
2. the way deleteExtraInputs removed loop-carried dependencies was unsafe because it would lookup
Value* elements in the loop body's environment that were previously invalidated when deleteExtraInputs
remove another input to the loop. This changes the way deleteExtraInputs works so that it never has to
read a value out of the loop body's environment to avoid using the invalidated pointers.
* Workaround in onnx to get transposes into init_nets
This adds a pass to ONNX so that it can speculate Transpose
operators so that ONNX's split pass can put them into an init_net
Also fixes a potential bug in onnx peephole where an optimization
across blocks might move a Value and violate scoping.
* Perform shape propagation when embedding a program into a trace.
This ensures the trace still has type information specific to that trace, which will help onnx export succeed in more cases.
The long-term fix is to remove the handling-creating pathways and
remove all the modes from PythonOp making it into an op that simply
calls a PyObject. Right now ONNX expects PythonOp to hold a
nn.Function, not a generic callable, so completely removing the legacy
pathway will also require changes to how ONNX symbolics are found.
* [jit][script] Fix a bug combining sizes/unsized tensors
This add an isSubtypeOf method to reflect that sized tensors are a subtype
of Dynamic[Tensors]. It updates the typechecking code to reflect this
relationship.
* Add index_select to shape prop
* Support list and tuple literals: Adds support for [a, b], (a, b) and "a, "
* Allow non-tensors to reach emitBuiltinCall, each SugaredValue::call
is now responsible for checking the types of its inputs.
Add support for calling cat with a tuple to emitBuiltinOp
Previously we would see errors like:
variable 'states' previously has type (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) but is now being assigned to a value of type (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor):
since the default case in the diagnostic printout was "Tensor". This adds a virtual member function to each Type class that returns a human-readable string for better error reporting
* Improve error reporting for tuple type mismatch
* Add better Tensor printout
* Fix performance regression on simple cases of indexing
Dispatches to the old kernels
* Adapt JIT test
The test was expected to fail, but due to the change in the previous diff, it would now dispatch to index_select, which succeeds. I modified the function to go through the advanced indexing codepath
* Only do checks once, properly AutoNoGil, AutoGPU.
We allow variables defined inside of if statements to be defined after
if statements as long as they will be defined unconditionally. This
supports a larger subset of python programs than we supported before.
* Eliminate handle_zero_dim when broadcasting is applied earlier.
This ends up not actually doing anything unless all the broadcasted tensors are scalars,
which ends up with inconsistent behavior in that case only, because the type promotion rules are different.
This is better solved with real type promotion logic.
* Change type of script comparison to long.
* Fix jit tests.
* Fix cpp jit test by being consistent about long-vs-float.
* Consistent float and long.
* Use int64_t rather than long.
This adds the ability to trace script functions while preserving their
control flow. When the trace encounters a script function it inlines
the graph of the function into the trace rather than tracing the
function itself.
This modifies the registration process so that all script methods
in a ScriptModule are defined at once.
Method gains a `method_creator` callback that gets invoked when the
method is first called to define it if it has not already been defined.
Recursive cycles in this `method_creator` are checked.
This approach was chosen over first creating all the graphs and then
inlining the call sites because it will combine better with type
propagation for non-tensor types like tuples. e.g.
```
a = foo(b)
return bar(*a)
```
* Switch JIT passes to take a graph rather than TracingState
* Add pybind11 binding for ONNX pass from graph
* Fix canonicalize pass
* address comment
* Switch ToONNX to explicitly return new graph
* optimize_graph instead of optimize_trace
* Allow tuples to be re-assigned
This commit improves our support of tuples by making them more first-class.
In particular, it allows tuples to be re-assigned across loops and ifs.
It does this by making them first-class values in the Graph IR, and then
removing the tuples in a LowerTuples pass.
An alternative approach would have added more support for desugaring tuples
in the Environment object as they were emitted. Instead,
the current approach was chosen anticipating a future when tuples are
fully supported (including the interpreter). In that future, the current
code can be completly reused with the LowerTuples pass just becoming
a optimization that removes unneeded tuple allocations.
* Fixes to the way script handles multiple values, and other minor fixes.
This commit improves our handling of operators that return multiple values.
Builtins are now checked so that they return the right number of values,
and support for TupleValue is extended to all things that can return
multiple values.
This resolves issues where the compiler accepted things like:
a, b = c + c
This would cause the interpreter to crash. Now each operator knows
how many results it will produce and can check it against the number
of requested inputs.
Notes:
* Allow True/False literals in constant expressions
* make handling of keyword constants more consistent to support True/False
* make parsing constants match the way we construct constants from python
* improve the error messages when accessing bad graph attributes.
* switch findTensorOp to return an optional.
* check that attribute types are correct in findTensorOp
* Check the correct number of outputs for builtins
This also changes emitExpr to return a single SugaredValue
Rather than possibly returning multiple values, emitExpr now
always returns a single value, which _might_ be a tuple. This approach
more closely follows python making the code easier to follow.
Checks for returning the right number of values are now located in
the assignment operator, and occur when unpacking the tuple.
We still pass `n_binders` to function calls so that calls into python
know how many values they should return.
* Unit test for pack_padded tracing
* Move monkeypatching stuff
* Switch symbolic
* Fix stack traces and update test
* Fixup and confirm e2e working
* lint
* Move monkeypatch back to onnx
* Address comments
* remove extraneous import
* Add gradient checking
* lint
* Address comments
* improve test case
* Something that works
* Tuple sugared value
* Works with commenting out input size check
* support string frontend
* Initial starred assignment
* Fix parser
* Fixup tests
* clang-format
* fix rebase error
* lint
* move star assign test to string frontend to make py2 happy
* Py2 fix: parse starargs from Call node
* Address some comments
* Fixup merge
* Remove overloaded unary operators
* Bugfix and test case
* Address a few more comments
* asValues -> asTuple
* Remove unrolledFor stuff
* Fixup getValues
* Pass CallsiteDescriptor struct and have different behavior for different call types
* Address comments and lint
* some type checks
* Address comments
* lint
* Fix mistake
Like `__slots__` the `__constants__` property changes the set/getattr behavior of a script module for the keys listed so they behave as constants.
This enables script methods to use them in way that are otherwise not allowed.
* Python numbers/bools can be inlined as constants in script code.
* List of numbers can be iterated over using for loops
* nn.ModuleLists can be used in for loops as well, unrolling their content.
Script functions can now have no return statements, empty
return statements, or return one or more values.
Additionally fix the lexer to always emit TK_NEWLINE before
TK_DEDENT, which simplifies the parser.
Allows you to export an ONNX model as:
Protobuf file (this is what we have now)
Uncompressed zip archive
Compressed zip archive
Directory
* Experimental support for different ONNX export types
* Remove a copy
* Add comment
* Add test cases
* lint
* fix bug
* address comments
Before, using an unknown binary operator like `@`:
```
import torch
@torch.jit.script
def mm(x, y):
return x @ y
x = torch.randn(4, 3)
y = torch.randn(3, 2)
mm(x, y)
```
resulted in [this not-so-readable trace](https://gist.github.com/zou3519/052b8998108c4bc0fe0e7c85c6f5758e).
Now, it tells the user that the problem is an unknown binary operator:
```
NotSupportedError: unsupported binary operator: MatMult
@torch.jit.script
def mm(x, y):
return x @ y
~~~ <--- HERE
```
* allow calls to non-script methods, allow python non-script attributes in methods
* add test to make sure submodules are not reassigned
* Test that we can change python attributes
* Deprecate ctx.saved_variables via python warning.
Advises replacing saved_variables with saved_tensors.
Also replaces all instances of ctx.saved_variables with ctx.saved_tensors in the
codebase.
Test by running:
```
import torch
from torch.autograd import Function
class MyFunction(Function):
@staticmethod
def forward(ctx, tensor1, tensor2):
ctx.save_for_backward(tensor1, tensor2)
return tensor1 + tensor2
@staticmethod
def backward(ctx, grad_output):
var1, var2 = ctx.saved_variables
return (grad_output, grad_output)
x = torch.randn((3, 3), requires_grad=True)
y = torch.randn((3, 3), requires_grad=True)
model = MyFunction()
model.apply(x, y).sum().backward()
```
and assert the warning shows up.
* Address comments
* Add deprecation test for saved_variables
* Implement range for loop in script
* Fix handling of boolean constants
* Use WithInsertPoint
* Allow dynamic max trip count
* fix symbols
* Fix argument order
* fix test
* Add insert{Input,Output} APIs and use them
* Factor out condition stuff
* clang-format
* Address remaining comments
* Fix tests
* Implement script in AST frontend
* Have ScriptModule inherit from Module
This is accomplished by created replacement _parameters, _buffers,
and _modules which implement the OrderedDict APIs but which
actually get/set their members inside script::Module
* Merge TracedModule with ScriptModule
* Move logic of attribute handling into Python bindings rather than
make script::Module handle it. This was redundant with nn.Module,
which already handles attribute.
* Make TracedModule a subclass of ScriptModule
* Move handling of attribute kind logic into bindings.
* Allow ScriptModule to contain non-script module submodules.
* Namespaced symbols
- Our interned strings now have structure, "ns::symname" rather than just
"symname" before. We support efficient namespace testing for uniques
by encoding the namespace in one byte in the Symbol internal representation.
See torch/csrc/jit/interned_strings.h for a more in-depth implementation
discussion.
- All uses of ksymbol are now attr::symbol (or some appropriate namespace).
The valid namespaces are prim, attr, onnx and aten.
- Symbol is bound in Python as a qualified string "attr::symbol", EXCEPT for the
attribute setting/getting API, whose symbols must always be attr
symbols; they get special cased to assume strings are passed.
There's a little bit of naughtiness in the implementation, maybe you know
how to solve it.
- However, the g.op() convenience function assumes that you're generating
ONNX operators, unless you explicitly qualify.
- All ATen operators and nodes have built-in interned strings generated
for them, so you should never have to write a string literal ever again.
The tracing code is adjusted to use it.
- ONNX exporter now properly tests to see that all operators are in
onnx namespace before accepting the export. This is way more
robust than the previous exporter, which would be willing to
export capitalized operators which were not actually ONNX operators.
- A slight organizational change for symbolic.py; this module now ONLY
contains aten operators. In particular, the exporter for Constant
has moved into utils.py (along with Undefined, from the C++ side),
since primitive ops get "special treatment."
- The un-inplacing logic in recording is more robust, so that we don't
delete a trailing underscore from __and__. This never affected us
before because we didn't have any tests for it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Add script::Module C++ class to represent script modules
switch AST -> IR conversion to work on Modules/Methods rather than raw graphs
function-only AST -> IR conversion is just a simplified case where there is
only one module with a single method and no parameters.
introduce SugaredValue in compiler.h to represent values in scope in a script
function that are not first-class and that get desugared. This is used to
represent the module's self parameter, as well as python function calls,
and method calls on tensor
provide a Python ScriptModule that provides a nice API on top of script::Module
allowing for the definition of script modules with methods, parameters,
and submodules
Not in this PR but intended for the future:
ScriptModule actually subclasses nn.Module, with most methods implemented
Unification of tracedmodule and script module functionality into one container class.
Detailed changelog:
* Switch compiler over to using Module, but don't
use them yet.
* Remove intermediate attribute encoding in compiler
* Create SugaredValue object to handle resolution
of compiled module.
* switch to_ir to modules, implement Select
* hacky python wrappers
* Private ScriptModule
* Add `define` to script module
* Attributes use TK_LIST_LITERAL
this anticipates adding a real list literal expression to the language.
* Add a metaclass to make sure script stubs are registered
* Add a test
* Doc createResolutionCallback
* Docs and minor editing
* Address PR comments
* Document
* Fix unicode issue
* Check if node output matches in shape propagation
* Fix list attributes and view shape propagation
* fix inferred shapes for view
* Fix shape inference for integrally typed tensors
* Fixes for concat in control flow
* Fix print
* Add Python function calls to script
* Script compiler gains a `Resolver` object that runs when it does not understand a function call. This decouples the python resolution from the conversion to IR.
* Add source information to IR nodes
SourceRange information from the script is not propagated to IR nodes.
This information is only used in two places now: the interpreter
wraps errors that occur when an instruction executions and shape
propagation now reports errors on the line where it fails:
Traceback (most recent call last):
File "test/test_jit.py", line 1655, in test_script_error
bar(Variable(torch.rand(10), requires_grad=True), Variable(torch.rand(9), requires_grad=True))
RuntimeError:
The size of tensor a (10) must match the size of tensor b (9) at non-singleton dimension 0:
@torch.jit.script
def bar(c, b):
return c / b
~~~~~ <--- HERE
In the future, shape propagation should really not report any size
errors and instead just not propagate shapes and let the actual
execution fail. However, this is hard to accomplish while we still
depend on running the op to do shape propagation.
Support shape propagation with control-flow
* This allows us to enable optimization in the GraphExecutor for most
script tests.
* Changes Type to always be present (non-null) on a Value, removing `hasType()`
and `typeOption()`. A new type kind 'DynamicType' now represents when
a specific type has not been determined.
* If/Loop nodes propagate shapes/types in the simple cases where types of
outputs do not change depending on where control flows. In other
cases, we propagate DynamicType to indicate we do not know what
the shape will be.
* Remove the `cond` input to the body of Loop to simplify handling in
interpreter and shape propagation.
* Bugfix for zero-dim contiguousStridesOf
* torch.jit.trace annotation now creates a GraphExecutor
The other torch.jit.trace, which was used for testing purposes and for onnx to get the trace graph, is now called torch.jit. torch.jit.get_trace_graph.
* @script annotation, and compilation unit for strings
Additionally:
- add support for calling functions that are not methods in the Python frontend
- add an end-to-end test for the Python frontend
- add a capture_stdout helper for checking that `print` actually works
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.
To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.
There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:
https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
* Use stacks in the interpreter/aten_dispatch
Rather than have separate input/output lists,
the interpreter now works using a single stack.
Operators in the interpreter push/pop from the stack.
This allows ownership of tensors to transfer directly to an operator,
and an operator can drop the reference to a tensors as soon as it is
no longer needed. This is important for the GraphExecutor op,
which recursively runs the interpreter.
Once autograd is updated to pass variables to Function by value,
we will be able to ensure that we release ownership as soon as possible.
This commit also switches the interpreter to use a fake
tensor 'ContainerTensor' rather than at::Retainable to hold non-tensor
data in the interpreter. This allows us to use std::vector<at::Tensor>
for all registers, which is significantly less confusing than the
OwnedRetainables struct it was replacing.
* Add If and Loop to interpreter
* Preprocess loop to calculate where references to tensor should be dropped
* Add control instructions JumpZ/JumpNZ/Jump
* Switch from explicitly having stage structs to having a single list
of instructions with Store/Load instructions to take values off the
initial stack
* Make the interpreter tests executable rather than use expect files
* add a flag to interpreter code so that constants are variables
if the interpreter is running on variables.
* Add tensor_as to its own file
* at::maybe_data_ptr and Check.h => TensorUtils.h
* THNN support for optional BN running_*
* ATen support for optional BN running_*
* Python nn.* support for optional BN running_*; Improve IN and BN doc
* Add tests for IN and BN new option
* Layer Norm
* Fix LRN doc
* functional interface for LN and IN
* Layer norm tests
* fix BN double backward returning undefined tensors
* fix jit test using wrong dim inputs for BN
* add/improve BN, IN and LN GPU tests with half type
* Udpate docs to be consistent with Conv notation
Fix onnx
Clarified onnx symbokic wrapper
* fix typo
* Address comments
The Tensor and Variable classes are being merged.
autograd.Function.forward is now called on Variables, but with "no-grad"
mode (torch.no_grad()) enabled.
One benefit is that we no longer have to explicitly track shared
storages.
This adds the initial implementation of graph executor for the new JIT design. It includes a few python tests ensuring that nograd, backward, and double-backward cases work for simple examples and some corner cases. More work needs to be done to performance optimize as there are many extra copies and places where we hold onto variables longer than we should. These are noted in the comments.
This pass splits differentiable subgraphs into their own Node,
similar to a fusion group.
This initial implementation does not create optimal subgraphs, but
it works well in the case where most things are differentiable,
and has the building blocks (`mergeNodes`) to extend to the
better implementation.
* Fix#4480 by tracing inputs before running function.
The DCE trick says that if I have y = f(x), and f is internally implemented as
g, it's OK to trace both g and f. Recall the tracing algorithm is:
enter f(x)
compute its result y
trace y = f(x)
return from f
So when you run the example above, you'll do this:
# suppose x is mapped to %1
enter f(x)
enter g(x)
result of g is y
trace y = g(x a.k.a. %1) (mapping y to %2)
return from g
result of f is y
trace y = f(x a.k.a. %1) (remapping y to %3)
return from f
and end up with a trace like this:
%2 = g(%1)
%3 = f(%1)
... only %3 is live, because %2 was killed from the mapping... Subsequent DCE
will eliminate the invocation of g and you'll only see f in the final trace.
However, if f and g are inplace functions, the machinery breaks:
# suppose x is mapped to %1
enter f(x)
enter g(x)
result of g is x
trace x = g(x a.k.a. %1) (remapping x to %2)
return from g
result of f is x
trace x = f(x a.k.a. %2) (remapping x to %3)
return from f
resulting in:
%2 = g(%1)
%3 = f(%2) # OOPS
This commit changes the strategy so we instead do this:
enter f(x)
trace f(x)
compute its result y
trace y = f(x) (computed above)
return from f
Now we get the correct Value before it is overwritten.
Here is what the new trace code looks like:
jit::tracer::PreTraceInfo trace_info;
if (jit::tracer::isTracing( self, index )) {
trace_info = jit::tracer::preRecordTrace( "index_fill", { self, index } );
setattr(trace_info.n, jit::Symbol("dim"), dim);
setattr(trace_info.n, jit::Symbol("value"), value);
}
baseType->index_fill_(self_, dim, index_, value);
increment_version(self);
rebase_history(self, grad_fn);
if (trace_info.state != nullptr) {
jit::tracer::postRecordTrace( trace_info, { self } );
}
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Revert "Hot patch ONNX _run_symbolic_function"
This reverts commit d1c973fee1.
* lintfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add missing expect file
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Implement MM fusion (MM with add reduction tree)
A tree where leaves are matrix multiplies and inner
vertices are adds can be computed as a single mm.
Such subgraph often appear in backward if a single weight
is reused multiple times (e.g. in RNNs).
NOTE: this seems to be slightly slower on the GPU than the
naive implementation, but it's a huge win on the CPU
(think 100x lower overhead)
cuModuleLoad is only valid for a single device so we need to
compile for the particular device that the fusion group will run on.
CompiledFunction already specializes different traces for tensors,
so we just need to have fusion_compiler produce the cuFunction on
the right device.
* Fix tracking of tracing scopes during ONNX pass
* Use ResourceGuard to manage setting a temporary current scope in Graph
* Add tests for ONNX pass scopes
* Remove unused num_classes argument
* Delete obsolete basic ops.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* More deletion.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete some unused utilities.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete dead apply_fn
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete CppFunction symbolic support.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete ForwardFunction
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Batchnorm is 'working'
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Rename THNN convolution to have thnn_ prefix.
- Propagate CuDNN benchmark and deterministic to at::Context
- Add 'convolution', 'convNd' and 'conv_transposeNd' native wrappers, with defaults
The conv_transposeNd wrappers are updated to have the same argument
order as Python.
- torch.nn.functional directly dispatches to the native wrappers
- Make it possible to turn off tracing for some native wrappers, so I don't
have to write symbolics for all the functions above
- Spectral ops can now make use of CuDNN convolution if possible
- Better commentary on cudnn_batch_norm
- Turn on DCE for all JIT tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Further relax VariableFlags
* Allow a requires_grad=True trace to be used for a requires_grad=False
input by computing the gradient but they not connecting it to the
input.
* Enable CSE to de-duplicate WLM backwards pass code which calls sum twice.
* Fix a bug in the interpreter that frees a register too early when
it appears twice in a use list.
* [fuser] Follow all outputs to check if fusion is safe
This bug was introduced when we allowed fusion groups
to fuse together. Previously producers were forced to have a single
output, but now producers that are fusion groups can have multiple outputs.
So now we check the uses of all the outputs of a producer.
* [JIT] Fix handling of undefined inputs
It is not legal to call .data() on variable objects whose tensors
are undefined.
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().
In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()
Fixes#3627
This method prints a bunch of useful debug information including
the traces that have been record, their shapes, and the traced
graphs associated with them.
* Trace ATen non-primitive functions as themselves, not their implementations.
Previously, if I invoked an ATen non-primitive function foo, which in turn
called subfoo, I would always see 'subfoo' in the trace (e.g., tracing
'inlines' all of these operations.) Such inlining is bad for ONNX
(and can be bad for optimization) as it prevents high-level
optimizations from taking advantage of the structure. It might
be right to inline, but give the optimizer a chance to work before
inlining happens!
The implementation here is surprisingly simple, because it uses
the "DCE trick". Essentially, it doesn't matter if the constituent
calls perform tracing, because you can always trace it again, and
override the trace nodes associated with the returned variables.
The original trace becomes dead and can be DCE'd.
While implementing this, I also refactored how 'isTracing' and
'trace_outputs' works:
- isTracing was previously a single function with overloads for
both Tensor and Variable arguments. Unfortunately, such overloads
are not safe, because of how C++ implicit conversions work. You
would think that C++ should never confuse an overload for
Variable with ArrayRef<Tensor>, but this is exactly what can
happen: Tensor is convertible to both Variable and ArrayRef<Tensor>,
thus it's ambiguous and C++ doesn't like it. The last time I ran
into this problem, I applied initializer lists to everything and
called it a day. A more robust fix is to separate out the
Variable and Tensor overloads, which I have done in this patch.
- trace_outputs was fed as an initializer list, which doesn't work
when you have heterogenous inputs. So instead we first feed
everything through 'flatten', which has overloads for each of the
argument patterns in ATen, which then goes on to the recordTrace
(which takes an ArrayRef). This is *no less efficient*, because
we were allocating a vector anyway (to do the conversion from
vector of Tensor to vector of Variable).
This fixes mean that 'index' can properly be traced... although the
JIT still does not support it. A failing test case has been added to
this effect.
Some knock-on effects:
- The fuser now knows about chunk as well as split. They're pretty
similar so there is no problem.
- There is a new 'canonicalize' pass in the JIT which renumbers a graph
so that all structurally equivalent graphs render the same.
- We run DCE before the fuser tests, to make sure dead nodes don't
block fusion.
- There are new ONNX exports for the newly introduced higher level ATen
operations. This includes type_as (no-op case only), chunk, select.
Zach didn't like the extra use of 'native' in the new codegen, so
we've introduced a new concept, 'abstract'. An abstract function
is one that is implemented in derived types (e.g., CPUDoubleType),
where as a concrete one is implemented in the base type (Type).
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This adds a simple fusion backend for the CPU.
* Refactors CompiledFusionFunction to have two subclasses that handle
the compilation details of each backend.
* emit-compile-link-run cycle for the CPU
* simple single core loop to run the operation
* lift CUDA-only restrictions in the fuser, checks that fusion groups
are only on a single backend.
* Add interpreter support for Handles/PythonOp/CppOp
This treats Handles as a first-class type in the interpreter
since this turned out to be conceptually simpler than treating
them as a separate concept, which requires a second channel for
register allocating and moving data from one op to the next.
Notes:
* The refcounting nature of tensors is factored into its own base type
so that it can be shared with other refcounted types such as handle.
* Some methods redundant with TensorBase have been deleted from Tensor
* The interpreter uses raw refcounted handles. In addition to being
able to treat Tensors and Handles as the same base object, it removes
a lot of redundant refcounting as objects moved from tensors to input/
output lists.
* aten_dispatch has been updated to work directly on the raw refcounted
lists to avoid refcounting and duplicate lists.
* Removing jit_closure.cpp, The interpreter can now handle all pathways.
* Functions like `unsafeToTensorShare` describe how
ownership transfers in the interpreter. The `Steal` variants
take rvalue references as arguments, and invalidate those
arguments to prevent potential problems.
* Make TensorTemporary is not a subtype relationship because it is too easy to
do something horribly unsafe:
```
void foo(at::Tensor bar) {
// bar destructor call release on a temporary!
}
foo(TensorTemporary(retainable)); // structure slicing!
```
Occasionally Travis builds would fail on these two tests.
It's not entirely clear where this nondeterminism is coming
from.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This commit adds a Value type similar to the one @ezyang suggested a while
ago for handling multi-return nodes.
Previously if we had a graph like:
a = op1(b)
c, d = op2(a)
Then its in-memory format would look like:
%0 = op1(b)
%1 = op2(%0)
%2 = select(%1, 0)
%2 = select(%1, 1)
Select nodes were used only to handle the multi-output case. In the
single-output case ops referred directly to their uses.
This required special handling for the single- and multi- output cases,
and was confusing when used with ONNX which distinguishes values (the
inputs/outputs of a node) from the nodes themselves (e.g. a Conv).
This commit adds the Node/Value distinction to the IR. In the example
above, `a`, `b`, `c`, and `d` are now Value objects, while `op1` and
`op2` are now Node objects. Inputs/Outputs to the graph are values.
* Nodes now always have multiple outputs, accessible through their `output()`
method.
* Methods exist for adding/removing outputs from a node.
* Nodes own their output Values, destroying a node destroys its outputs and it
is only valid to destroy a node when no uses of its outputs remain.
* Unlike select, Values do not appear in the nodes list.
* The method `node()` on `Value` retrieves its defining node. Calling it
is always valid. For inputs, its kind is "Param". Like "Return" there is a single Param
node representing all inputs.
* For single-output Nodes, the method `output()` retrieves the single
output Value, asserting that the node is in-fact single output.
* Functions are the same, but some functions like `type()` have moved to
Value.
* `replaceAllUsesWith` is now sanely defined for both Values and Nodes.
In the case of Nodes, it replaces all outputs of the node with the outputs
of the replacement node.
* stage is defined both on Node/Value. This is because Inputs require a stage.
* Apart from changing data types from Node->Value most passes remain the same.
Things that previously assumed single-output nodes now have to call output()
to get the node.
* This removes the uses = [...] field in the outputs because it was
getting confusing even before this commit when uses would refer to nodes,
but we print the names of Values. The lint pass validates the use list,
so printing it out seems less necessary.
This started off as a minor fix based on Adam's question, "why is printing
a graph not const" and snowballed into a giant yak shaving exercise.
- The Graph and Node APIs now uniformly enforce deep constness; e.g., if you
get a const Node* or const Graph*, it is not possible to get a non-const
Node*/Graph* somewhere else in the graph (even though the member variables
of these are non-const. Hooray for private access specifier.)
- A big pile of functions got const versions, most notably the printing
functions, and functions for accessing inputs().
- REALLY IMPORTANT, BC-BREAKING CHANGE: inputs() now returns a COPY of the
inputs, rather than a reference to the underlying. I was forced to do this
because there is no way to portably turn a std::vector<Node*> into a
std::vector<const Node*>, which is necessary to provide a const-correct
version of inputs() that enforces deep const-correctness. I then justified
this choice to myself with the observation that outputs() returned a
copy (by necessity), so this makes the API more uniform.
But making this change uncovered two very subtle bugs:
1. If you change functions from returning a reference to returning a copy,
the idiom node->inputs().begin() is no longer valid, because the memory
the iterator points to immediately becomes invalid. THIS SUCKS.
Honestly, we should add a lint rule rejecting calling begin()/end() on
temporaries because this is very dangerous. To excise this pattern from
the codebase, I added begin() and end() methods to Graph, so that we got
rid of the graph->nodes().begin() idiom, which happens to be sound,
despite not returning a reference, because graph_node_list is a
non-owning reference.
2. pybind11 doesn't handle std::vector<Node*> cast out of the box.
Fortunately, I found a simple fix in the GitHub issues tracker
that involved adding an extra type converter. And yes, this
does mean that outputs() in Python never worked correctly.
- New const_graph_node_list, which is a graph_node_list that gives you const
Node*
There are some more miscellaneous improvements:
- Applied CR comment fixes on export.cpp; using replaceInput, and renaming
variables for clarity.
- assertValidInput helper method added, and applied to replaceInput
- Use an explicit function to print THPObjectPtr, otherwise we get
the wrong overload.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* update fuser to match ATen-formatted JIT ops
* fix concat optimizations and add test
* allow onnx export to work with single-export functions
* fix onnx handling of multi-return nodes.
* nits, format, vision test update
* fix add constant
* fix driver init issues
* Add missing Neg symbolic.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The general strategy is there is a new module, torch.onnx.symbolic, which
contains a function for every ATen method name with the ONNX translation.
While implementing this, I took the opportunity to expunge all references
of 'g' from the public API; instead, it is managed by a global variable in
torch.onnx which tracks the "current graph".
Other changes:
- If you pass a Tensor to op as an argument, it will now automatically be
converted into a Constant ONNX node. This lets us remove needing to
implement ONNX
- Rename value to other, wherever there is both a Scalar and Tensor overload.
This way, keyword dispatch can work uniformly in both cases.
- Deleted any autograd Function classes that both had a symbolic and were ported
to the new C++ autograd implementation. There may still be some straggling
classes that didn't have symbolic.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
It's pretty easy to accidentally fail to actually compile
a JITed region, which means that we have accidentally failed
to have test coverage for a number of features. This adds
a secret _assert_compiled kwarg, which will raise an error
if we don't actually hit the compiled codepath.
This is not intended to be user visible; we have some other
ideas for handle this case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
A few notes about the implementation:
- Need to plumb 'devices' through to the 'fork_rng' calls. You definitely
want these; it makes verify run A LOT faster
- New keyword argument for compiled model execution, '_force_trace', which
forces us to retrace a model.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The alpha/beta naming in addmm was flipped; this commit fixes that
problem. It also fixes the ONNX export of alpha/beta parameters.
Finally, it supports executing matmul in the JIT.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
torch.jit now contains two user-facing functions: compile and trace
(corresponding to what was previously trace/traced and record_trace).
The non-curried versions of these functions have been eliminated, so
that there is only one function in the API (we *must* have the
curried versions, since these enable their use as decorators). There is
detailed usage documentation in the docblocks for these methods.
This comes with a complete rewrite of the internals of torch.jit, in the process
fixing a number of bugs. Key points of the new implementation:
- compile and trace both always return a Module representing the wrapped
with compilation/tracing underlying function/module. This makes handling
of the function/module cases more uniform, as we can think of the function
case as creating an on-the-fly module with the parameters explicitly
specified by the user. For technical reasons, we now *require* any parameters
in the function case to be honest-to-goodness Parameters (gory details:
you can't register a Variable as a Parameter to a Module, but you can't
create a Parameter from a Variable while sharing the same underlying
identity.)
- Flattening and unflattening is done a lot more uniformly. We now have
a _flatten and _unflatten function which are inverses of each other:
_flatten always returns both the flat, tuple of Variables, *as well as*
the "proto" (now referred in the code as the "struct") from which we
can unflatten the variables. Low level functions like 'raw_trace'
always work with the flattened inputs/outputs, which keeps their logic
simple.
- JIT trace keying now also includes the "struct" of the input arguments.
This is a step towards accepting non-Variable arguments in functions,
although flatten/unflatten don't currently support it.
- TraceForKey (previously TraceInfo) has had its API reworked to have
less degrees of freedom when you are interacting with it.
TODO: Verify, timing, and trace dumping have been temporarily excised. I
plan on adding them back.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Plus a test for Eval nodes in the IR, since we hadn't actually
covered this case now that some nodes are transparently traceable.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
To be honest, this was the whole point of this refactor set.
I noticed that in a lot of code, we were repeatedly copying lots of metadata
from old nodes to new nodes. This was quite concerning because I wanted to
add some more metadata (alias information) and I didn't want to have to
get it right in all cases. Plus, in a lot of cases we were forgetting
to set more optional properties like debug names when we "copied".
To solve this, I first made cloneFrom() copy all of this metadata. Then,
I searched for all occurrences of setType() (a proxy for "I'm cloning this
node), looked for cases where we really were morally doing a copy, and rewrote
the code to use cloneFrom() instead, allowing us to drop explicit setType()
(and getting more metadata preservation in the process.)
Finally, I refactored tryToMoveChunk. The code is modestly longer,
but the new version has the nice property that the initialization of
selects for input_chunk are next to the creation of the node (as opposed
to delayed for later.) I also added a lot more comments for invariants
I noticed when I was working on the code.
One minor extra change: TensorType grew a new constructor and a withSizesStride
"immutable setter" which returns a new copy of TensorType with different info.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Variables now hold a list of ValueTracingStates and can participate
in multiple traces.
* Refactored Traceable to maintain a list of traces, and only stop
tracing once it records all stages
- BC BREAKING: export now also takes a mandatory file-ish argument, specifying
the file to export the protobuf to. I rewrote the tests to use BytesIO to
get out the string so they could parse it again.
- BC BREAKING: export no longer returns the tensors that were computed. To
get these, use the internal _export function.
- Multiple inputs to models are now supported by passing a tuple to input.
(Old API of a single Variable still works.)
- Keyword arguments to models are now supported via kwargs keyword arg.
- Renamed embed_params to export_params, and it now defaults to True.
- Toffee tests now live in their own test_toffee.py file. I had to
rename a pile of expect files for this.
- Removed defunct torch.toffee imports from autograd to solve module import
cycle.
- Helper function _with_file_like to abstract over opening file-ish arguments,
taken from torch.save()
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Reduce setup.py diff.
- Expunge WITH_TOFFEE from codebase.
- Elaborate on a comment.
- Move gen_toffee.sh to tools
- Delete densenet test.
- Use 'using' to inherit a constructor.
- Delete outdated comment.
- Comment about why primspecs can return fewer outputs.
- Remove dead, commented out includes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Along the way I added converters for Variable and TracingInput. Variable should
probably be moved to a more widely known spot.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
I realized we weren't running the linter after ToToffeeIR, so
I added a lint call. It thus emerged that the current implementation
was using "Unused" nodes that were not added to the graph,
which was tripping the lint. I fixed this a few ways:
- BatchNorm and Conv primspecs were returning dead "unused" nodes
for their (implicit) handle parameters. I removed them because
setOutputs handles this already, and a dead unused node which
is not attached to the graph violates the "no dead nodes"
invariant.
- OK, but MaxPool actually needs to return a unused node for
the output which supported by PyTorch but not Toffee; we need
to error if subsequently in the trace this output is used.
The new strategy is to have MaxPool's primspec return a None
at the unused position, and then immediately *check* if there
are any uses of that output. If there are, that's an error!
- I needed to adjust the Select invariant in the exporter loop:
only if a Select node has *uses* is it mandatory for it to be
defined in env.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Basic idea:
- Pass buffers (marked as non-Variable tensors) as input variables to
the trace. Every buffer gets represented as an input variable
to the trace, and we remember a correspondence of the underlying
TH pointer and an input variable in the trace.
- When we initially trace a function, we DO NOT record the buffers
as edges. This is so autograd doesn't have to know anything about buffers.
If we ever turn buffers into requires_grad=False parameters, then
this problem goes away.
- When we primspec the buffer, NOW we reach into the cached buffers
(now appropriately named) and gin up the buffer information we need.
Other things:
- CppOp execution is now supported (but lightly tested) using
SimpleEval (thanks @apaszke!)
Todo:
- E2E tests need to have their hacks removed.
- Figure out what is going on with backwards
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
General strategy:
- nanopb is statically linked into PyTorch. It must be built
with -fPIC.
- Generated nanopb files for toffee.proto are checked into
our repo.
- Because nanopb generated protobufs are C only, we wrote a
wrapper around it to give a Google C++ style interface.
More on this shortly.
How does the wrapper work?
- It's called "micropb" becaues it is less small than nanopb :)
- nanopb requires all variable-length fields to be written out
using a "callbacks" mechanism.
- We wrote pre-canned callbacks for all of the types ToffeeIR
writes out and lists; these are micropb_callback and
micropb_callback_list. These operate simply by dynamically
allocating and storing the data to be written out in
data (this defeats the purpose of the callback mechanism,
but it's easy to implement)
- Finally some boilerplate to actually implement the wrapper
classes and have owning pointers to the actual data.
Testing strategy:
- Take the serialized protobuf from nanopb, parse it again
with ToffeeIR and print it. Worked with all of test_jit.py!
These tests don't run without 'toffee' being installed.
TODO:
- Update CI to install ToffeeIR, so we can run the Toffee tests
in CI
- Update E2E with Caffe2 tests so that they work with new stuff.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This addresses when bias is disabled, which occurs in torchvision's
alexnet and densenet.
The general strategy is this:
- When we encounter a null variable, we turn this into a Constant
node with an undefined at::Tensor
- Toffee exports for BatchNorm and Conv have special cases for bias,
checking if they are provided by a Constant node with undefined
value, and just omit the input if so.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The general strategy:
- We put all the toffee files in torch/csrc/toffee; they will only be
added when toffee is enabled
- Toffee is enabled if torch/lib/ToffeeIR is present (since we
don't have a submodule/subtree thing going on)
- The most prevalant place you will need to use WITH_TOFFEE is for
primspec definitions on C++ autograd functions. There is a
macro HAS_PRIMSPEC to ameliorate optionally defining primspec()
virtual overrides on Function classes. HasPrimspec is always
available but will be a zero field class when Toffee is disabled.
NB: We might revert this commit in the future if we figure out a way
to unconditionally enable Toffee that everyone likes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This commit adds a new exporter pass which takes a graph and returns
a string of the human-readable protobuf representation of a model.
We have two strategies for how conversions are implemented:
- If a Python autograd function has a primspec static method, we invoke
it to get the Toffee conversion. Use torch.toffee.op to generate the
format expected to be returned. The particular data representation is opaque
and subject to change in the future.
- Otherwise, there's a giant if statement in the exporter, which manually
uses the JIT IR C++ API and Toffee IR C++ protobuf API to convert.
You must check out a copy of the ToffeeIR repo
https://github.com/ProjectToffee/ToffeeIR at torch/lib; at the moment
we don't have a subtree/submodule set up.
Technical debt in this commit:
- To get protobuf headers in scope, we unconditionally add $CONDA_PREFIX/include
to the include path. This needs to be replaced with a more robust mechanism.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The API works on either functions or models, taking an extra parameter argument
so that functions can pass in additional variables to trace.
Other behavior is folded into boolean options:
time - collect stats for our own perf debugging
verify - run the original code, and check it is within threshold
optimize - run optimization (currently off until fusiongroups pr is accepted).
enabled - flag to turn off tracing so you can check timing of stuff that cannot be traced.
Approach is based on the approach of THC's pointwiseApply{1,2,3} family of kernels,
but doesn't have any dependencies on that code.
Adjacent contiguous dimensions of input tensors are compressed to reduce the complexity of indexing math.
For the completely contiguous case, the indexing logic simplifies to just the linear index.
In simple tests, this code matched or beat the equivalent from THC.
- To test whether or not a multiline string matches some expected
value, you can use assertExpected. This tests that the string
matches the content stored at a file based on the name of the
test (and an optional subname parameter you can pass if you
what to assertExpected multiple times.)
- Suppose you make a change that modifies the output in a big way.
Instead of manually going through and updating each test, you instead
run python test/test_jit.py --accept. This updates all of the expected
outputs. You can now review them one-by-one and make sure your
changes make sense.
We can add more features later (e.g., munging the output to make it
more stable, more sanity checking) but this is just to get us started
testing. One thing to watch out for is that accept tests on intermediate
representation can be a bit wobbly: it is *extremely* important that
people be able to read the IR. It may be worth introducing niceties
to the printer in order to ensure this is the case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>