* Add cudaEvent support to the profiler
This adds the ability to record cuda timings using cudaEventRecord
in the profiler. Since it doesn't require nvprof it is easier
to run than the nvprof path.
This also records a thread id for each event, which will make
tracing results easier to understand
* Add flow arrows from cpu to cuda event
* Fix no cuda build
* Review comments
* Move CUDA checks to one place
* Add a JIT interpreter
The separate interpreter is used to graphs with a lower overhead than
converting them to autograd graphs. Some notes:
* does not support Handles/PythonOp/CppOp, these will be in a future commit
* jit_closure.cpp still exists and we fall back to it for now when
cannot handle something because of PythonOp/CppOp
* In order to support retain_graph=True, the interpreter can be cloned,
creating a copy that can be run with different arguments. This is
assumed to be the non-standard case so cloning is not particularly optimized.
No tensor _data_ is copied, but the at::Tensor list in the interpreter is.
If we hit problems, there is a lot we could do (such as register allocation)
to minimize the stuff that needs to be copied.
* Uses a pImpl pattern to keep implementation details out of its header file.
* Modifies the way getTensorOp works so that it reads/writes to already-existing
vectors, this prevents needing to realloc these buffers each time.
* Timings are here: https://gist.github.com/zdevito/5a20ac29fb1b9e449e693b67dc478127
This reduces overhead to about the same as running it in python.
It is about 10us faster to run the same thing using ATen directly.
* Code Mod
Interpreter -> InterpreterState
Function -> Code
Add other requested comments.
* RegList -> ListHandle<T>
Change the RegList functions to be safer by identifying the type of
each argument list, and checking that list insert does not try
to add to two different lists at once.
* Use exactly equal for interp tests
Allow in-place operations on views
Adds VariableViewImpl, a subclass of VariableImpl which has a pointer to
the base Variable on which it is a view. In-place operations on views
change the grad_fn of the base.
Note that in-place operations only work on views that are the first output of the function that created them. All C++/ATen implemented functions have this behavior, but it's possible to write Python-implemented autograd functions that do not. In-place operations on these view will raise an exception.
Fixes#3313
* enable size from ATen type
* temp commit aten thd
* port copy, math
* port random
* changes after rebase
* lapack bind
* thd and csrc compile
* fix min/max reductions in DataChannelTCP
* clean up changes
* re-enable tensor constructors
* port MPI to at::Tensor
* fix storage methods to not cast to thpp storage ptrs
This breaks a lot of the onnx-pytorch tests because the abstraction
barriers are not respected. I'll spin up a patch for that separately.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The pieces:
- I improved the lint / asserts to catch some bugs which I
committed while working on my export. There are two new
properties which the linter checks now:
(1) "Anticipated uses". If a node says that is used by
M, M better appear later in the topsort. Previously,
we only checked if it was in all_nodes.
(2) If you are a select node, you better be a multi-type node;
if you're not a select node, you better not be! And you
should never have an input that is multi-type.
- There is a new peephole optimization pass, for simple, local
transformations to graphs. Right now, it implements a simple
optimization: remove 'expand' invocations that are no-ops
(the size before matches the size after), but we can add other
things to it later. I needed this for ONNX because no-op expands
show up in the left-hand argument, which we don't support.
- There is now a broadcast fuser, which fuses ATen expand ops
into broadcastable ONNX ops (Add, Div, Mul, Pow, Sub, Gemm.)
It only fuses when the original size is a suffix of the new
size, as per the ONNX spec.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Cleaned up THNN and THCUNN code and kernels
- Improved THCUNN kernel performance 5x, making it match cuDNN performance
- Added support for computing softmax over arbitrary dims
NOTE: The default dim for 3D inputs is now 1 (used to be 0)
- Both functions now accept inputs with arbitrarily many dimensions
- Autograd functions no longer save the input (it's unnecessary)
- Added cuDNN bindings for softmax, but they are unused as THCUNN
matches or even exceeds cuDNN performance
* skeleton commit for building and linking nnpack library in PyTorch
* first stab at conv forward binding + integration
* bind NNPACK gradient kernels
* move nnpack forward, input gradient calls deeper
* nnpack conv api mimics nn
* fix symbol error; use memory across calls
* clean up warnings, add shape checking, thread safety, configurable thread specification
* add batch size threshold, also bind for single-element batch for the future
This adds some generated autograd functions implemented in C++, which
are generated from derivatives.yaml. It also generates Python bindings
for the Variable methods. The generated files are:
Functions.cpp/h: subclasses of torch::autograd::Function
VariableType.cpp/h: The at::Type for autograd Variables
python_variable_methods.cpp: Python bindings to torch::autograd::Variable
python_variable_methods_dispatch.h: wrapper which releases GIL and sets the
CUDA device
python_functions.cpp/h: exposes generated autograd functions as Python
objects
The generated functions are mostly shadowed by the definitions in
variable.py. We'll remove the Python implementations in favor of the
generated C++ implementations in a subsequent commit.
Variable is now a subclass of at::Tensor backed by a VariableImpl* pImpl. The implementation of the ATen functions is defined in the auto-generated VariableType.h/cpp file.
Currently, only functions which fall through to the base type, such as sizes() and isCuda() are implemented. Differentiable ops like add() and mul() will be added in a subsequent PR.
- Reduce setup.py diff.
- Expunge WITH_TOFFEE from codebase.
- Elaborate on a comment.
- Move gen_toffee.sh to tools
- Delete densenet test.
- Use 'using' to inherit a constructor.
- Delete outdated comment.
- Comment about why primspecs can return fewer outputs.
- Remove dead, commented out includes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
If it's not set, CMAKE_DEBUG_POSTFIX sets it to 'd' which means the
static library gets named something different when built in debug mode.
This is annoying because it means if you build in debug mode, the
library is in a different place. Rather than teach the build system
to find the correct name, just set this POSTFIX so names don't change.
Also, update setup.py to look for the non-debug archive.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
General strategy:
- nanopb is statically linked into PyTorch. It must be built
with -fPIC.
- Generated nanopb files for toffee.proto are checked into
our repo.
- Because nanopb generated protobufs are C only, we wrote a
wrapper around it to give a Google C++ style interface.
More on this shortly.
How does the wrapper work?
- It's called "micropb" becaues it is less small than nanopb :)
- nanopb requires all variable-length fields to be written out
using a "callbacks" mechanism.
- We wrote pre-canned callbacks for all of the types ToffeeIR
writes out and lists; these are micropb_callback and
micropb_callback_list. These operate simply by dynamically
allocating and storing the data to be written out in
data (this defeats the purpose of the callback mechanism,
but it's easy to implement)
- Finally some boilerplate to actually implement the wrapper
classes and have owning pointers to the actual data.
Testing strategy:
- Take the serialized protobuf from nanopb, parse it again
with ToffeeIR and print it. Worked with all of test_jit.py!
These tests don't run without 'toffee' being installed.
TODO:
- Update CI to install ToffeeIR, so we can run the Toffee tests
in CI
- Update E2E with Caffe2 tests so that they work with new stuff.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The general strategy:
- We put all the toffee files in torch/csrc/toffee; they will only be
added when toffee is enabled
- Toffee is enabled if torch/lib/ToffeeIR is present (since we
don't have a submodule/subtree thing going on)
- The most prevalant place you will need to use WITH_TOFFEE is for
primspec definitions on C++ autograd functions. There is a
macro HAS_PRIMSPEC to ameliorate optionally defining primspec()
virtual overrides on Function classes. HasPrimspec is always
available but will be a zero field class when Toffee is disabled.
NB: We might revert this commit in the future if we figure out a way
to unconditionally enable Toffee that everyone likes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
We want all the conversion code to live in one place. Away it goes!
This means that alexnet protobuf no longer works. It will start working
again when we port changes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This commit adds a new exporter pass which takes a graph and returns
a string of the human-readable protobuf representation of a model.
We have two strategies for how conversions are implemented:
- If a Python autograd function has a primspec static method, we invoke
it to get the Toffee conversion. Use torch.toffee.op to generate the
format expected to be returned. The particular data representation is opaque
and subject to change in the future.
- Otherwise, there's a giant if statement in the exporter, which manually
uses the JIT IR C++ API and Toffee IR C++ protobuf API to convert.
You must check out a copy of the ToffeeIR repo
https://github.com/ProjectToffee/ToffeeIR at torch/lib; at the moment
we don't have a subtree/submodule set up.
Technical debt in this commit:
- To get protobuf headers in scope, we unconditionally add $CONDA_PREFIX/include
to the include path. This needs to be replaced with a more robust mechanism.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Approach is based on the approach of THC's pointwiseApply{1,2,3} family of kernels,
but doesn't have any dependencies on that code.
Adjacent contiguous dimensions of input tensors are compressed to reduce the complexity of indexing math.
For the completely contiguous case, the indexing logic simplifies to just the linear index.
In simple tests, this code matched or beat the equivalent from THC.
Previously, our AST was a DAG, where shared Nodes indicated a computation
should be reused. This commit rewrites the IR into a new functional
representation which represents sharing explicitly using variable
bindings.
We offer a few justifications for this new style:
1. The new representation is not all that different from the
old one; it is about as easy to construct, and the lack of an
explicit graph doesn't negatively impact our ability to interpret
the graph, since we've chosen, as a matter of design, to NOT have
the IR participate in the actual execution of a graph.
2. The new let-binding representation has an implicit ordering,
which we can use to conveniently keep track of the original order
the trace showed up as. This automatically gives us a topsort,
and gives us an easier to read textual representation of our
IR:
%14 = Embedding %11, %0, -1, None, 2, False, False
%15 = Dropout %14, 0.2, True, False
%16 = Index %12, 0
%17 = Index %12, 1
%18 = Index %13, 0
%19 = Index %13, 1
%20 = Index %15, 0
%21 = Linear %20, %1, %3
%22 = Linear %16, %2, %4
3. It moves us closer to a Futhark style language
(http://futhark-lang.org/publications/pldi17.pdf).
Major aspects of the diff
- Node is replaced with Expr and Arg, a pair of mutually recursive
structures which represent our new language. In BNF, the language
looks like this:
a ::= c | %i
e ::= %i, ... = e
| PyOp e, ...
| Ret %i, ...
Technically, Ret is not actually a return (no control flow is involved),
it just tuples up a series of tensors (identified by variables).
One important invariant is that locals are always tensors; they
are never constants (this is asymmetric with Args.)
- Arguments support Python constants. This is an important piece because
many operators take extra Python literals like integers and tuples in
order to specify extra parameters about how an operator operates. Adding
this was essential to getting word_language_model to work.
- As both Expr and Arg have multiple variants, there is new infrastructure
for doing case on the variants using ExprVisitor and ArgVisitor. The
strategy here is adapted from WebAssembly's visitors, although we have
generalized to permit arbitrary argument forwarding, which is necessary
to support tail-recursive visitor calls. TCO is important because our
interpreter may recurse arbitrarily deep into a stack of nested lets.
If users wish, they can also manually case on the type tag.
- Tracing is now turned on and off using _tracer_enter/_tracer_exit in
torch._C. _tracer_enter accepts a list of variables which are to be
treated as arguments; _tracer_exit accepts the list of traced variables
which should be returned when you reexecute the trace, and returns
the trace expression which can be reexecuted. GlobalTracingState
is a global variable which tracks whether or not we are tracing or not.
- You use run_forward to execute a trace on some set of parameters.
- When under tracing, variables keep track, via trace_local, what the
name of their variables in the IR are.
Here is a simple runner which leaks memory but can be used to JIT models:
import torch.autograd.function as F
import torch._C
def jit(model):
import types
real_forward = model.forward
def forward(self, *args):
def flatten(x):
return tuple(F._iter_variables(x))
if not hasattr(self, "saved_trace"):
torch._C._tracer_enter(tuple(self.parameters()) + flatten(args))
out = real_forward(*args)
self.saved_trace = torch._C._tracer_exit(flatten(out))
self.saved_outs = out
return out
else:
flat_out = Variable._execution_engine.run_forward(self.saved_trace, tuple(self.parameters()) + flatten(args))
return F._unflatten(flat_out, self.saved_outs)
Major problems:
- Sanity checking is spotty at best, especially when users pass in variables.
- The interpreter leaks tensor memory from the store. When we add back def-use
we should be able to deallocate tensors as soon as we know they are no longer
necessary.
- The interpreter needs to reach feature parity with the old execution engine.
From there, we need to see if backwards can be subsumed as well.
- I still have no confidence in having memory managed everything correctly.
This requires a close look.
- Rather than return an *open* expression as a trace, we should return a
*lambda* instead, which knows about how many formal parameters it
requires.
- The IR is not introspectable from Python at the moment, but this is simply a
matter of implementing all the binding code.
- The tracer is NOT reentrant (you can't trace while you're inside a trace.)
Furthermore, no sanity checking is done if you try to incorrectly reuse
things from one trace in another.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
When working on PyTorch dependencies we often want to rebuild only that
dependency and the Python extension. You can now do that by running:
python setup.py build_thc
to only re-build THC
Primary things I had to fix:
- Suppress _XOPEN_SOURCE warnings by ensuring that Python.h is included
first, because it always unconditionally defines this macro.
- Turn off strict aliasing, because Python 2 doesn't work with strict
aliasing.
- Workaround setuptools bug, where it's incorrectly passing
-Wstrict-prototypes to C++ compilers (where this doesn't make
any sense)
To compile csrc with -Werror, run `CFLAGS="-Werror" python setup.py build_ext`
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Because of this Variables can no longer appear in the graph.
Every usage of a leaf Variable will leave an AccumulateGrad
function that has no outputs, but modifies var.grad as a side
effect.
This ensures that we use the same library at the C++ level and with
Python ctypes. It moves the searching for the correct library from
run-time to compile-time.
The core autograd Variable, Function, and Engine no longer depend on the
Python API. This let's us implement functions in C++. In the future, we
can also multithread engine and release the GIL for most of the
non-Python backwards.
See issue #20
The torch.Size class is a tuple subclass which distinguishes sizes from
other tuples so that torch.Tensor(size) is interpreted as size instead
of data.
The from_buffer is similar to numpy's frombuffer. It decodes a Python
buffer object into a Storage object. For byte and char storages, it
simply copies the bytes.