Summary:
This PR allows Timer to collect deterministic instruction counts for (some) snippets. Because of the intrusive nature of Valgrind (effectively replacing the CPU with an emulated one) we have to perform our measurements in a separate process. This PR writes a `.py` file containing the Timer's `setup` and `stmt`, and executes it within a `valgrind` subprocess along with a plethora of checks and error handling. There is still a bit of jitter around the edges due to the Python glue that I'm using, but the PyTorch signal is quite good and thus this provides a low friction way of getting signal. I considered using JIT as an alternative, but:
A) Python specific overheads (e.g. parsing) are important
B) JIT might do rewrites which would complicate measurement.
Consider the following bit of code, related to https://github.com/pytorch/pytorch/issues/44484:
```
from torch.utils._benchmark import Timer
counts = Timer(
"x.backward()",
setup="x = torch.ones((1,)) + torch.ones((1,), requires_grad=True)"
).collect_callgrind()
for c, fn in counts[:20]:
print(f"{c:>12} {fn}")
```
```
812800 ???:_dl_update_slotinfo
355600 ???:update_get_addr
308300 work/Python/ceval.c:_PyEval_EvalFrameDefault'2
304800 ???:__tls_get_addr
196059 ???:_int_free
152400 ???:__tls_get_addr_slow
138400 build/../c10/core/ScalarType.h:c10::typeMetaToScalarType(caffe2::TypeMeta)
126526 work/Objects/dictobject.c:_PyDict_LoadGlobal
114268 ???:malloc
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
85900 work/Python/ceval.c:_PyEval_EvalFrameDefault
79946 work/Objects/typeobject.c:_PyType_Lookup
72000 build/../c10/core/Device.h:c10::Device::validate()
70000 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
66400 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
63000 ???:pthread_mutex_lock
61200 work/Objects/dictobject.c:PyDict_GetItem
59800 ???:free
58400 work/Objects/tupleobject.c:tupledealloc
56707 work/Objects/dictobject.c:lookdict_unicode_nodummy
```
Moreover, if we backport this PR to 1.6 (just copy the `_benchmarks` folder) and load those counts as `counts_1_6`, then we can easily diff them:
```
print(f"Head instructions: {sum(c for c, _ in counts)}")
print(f"1.6 instructions: {sum(c for c, _ in counts_1_6)}")
count_dict = {fn: c for c, fn in counts}
for c, fn in counts_1_6:
_ = count_dict.setdefault(fn, 0)
count_dict[fn] -= c
count_diffs = sorted([(c, fn) for fn, c in count_dict.items()], reverse=True)
for c, fn in count_diffs[:15] + [["", "..."]] + count_diffs[-15:]:
print(f"{c:>8} {fn}")
```
```
Head instructions: 7609547
1.6 instructions: 6059648
169600 ???:_dl_update_slotinfo
101400 work/Objects/unicodeobject.c:PyUnicode_FromFormatV
74200 ???:update_get_addr
63600 ???:__tls_get_addr
46800 work/Python/ceval.c:_PyEval_EvalFrameDefault
33512 work/Objects/dictobject.c:_PyDict_LoadGlobal
31800 ???:__tls_get_addr_slow
31700 build/../aten/src/ATen/record_function.cpp:at::RecordFunction::RecordFunction(at::RecordScope)
28300 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object*, _object**, bool)
27800 work/Objects/object.c:_PyObject_GenericGetAttrWithDict
27401 work/Objects/dictobject.c:lookdict_unicode_nodummy
24115 work/Objects/typeobject.c:_PyType_Lookup
24080 ???:_int_free
21700 work/Objects/dictobject.c:PyDict_GetItemWithError
20700 work/Objects/dictobject.c:PyDict_GetItem
...
-3200 build/../c10/util/SmallVector.h:at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool)
-3400 build/../aten/src/ATen/native/TensorIterator.cpp:at::TensorIterator::resize_outputs(at::TensorIteratorConfig const&)
-3500 /usr/include/c++/8/x86_64-redhat-linux/bits/gthr-default.h:std::unique_lock<std::mutex>::unlock()
-3700 build/../torch/csrc/utils/python_arg_parser.cpp:torch::PythonArgParser::raw_parse(_object*, _object*, _object**)
-4207 work/Objects/obmalloc.c:PyMem_Calloc
-4500 /usr/include/c++/8/bits/stl_vector.h:std::vector<at::Tensor, std::allocator<at::Tensor> >::~vector()
-4800 build/../torch/csrc/autograd/generated/VariableType_2.cpp:torch::autograd::VariableType::add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar)
-5000 build/../c10/core/impl/LocalDispatchKeySet.cpp:c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKey)
-5300 work/Objects/listobject.c:PyList_New
-5400 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionParameter::check(_object*, std::vector<pybind11::handle, std::allocator<pybind11::handle> >&)
-5600 /usr/include/c++/8/bits/std_mutex.h:std::unique_lock<std::mutex>::unlock()
-6231 work/Objects/obmalloc.c:PyMem_Free
-6300 work/Objects/listobject.c:list_repeat
-11200 work/Objects/listobject.c:list_dealloc
-28900 build/../torch/csrc/utils/python_arg_parser.cpp:torch::FunctionSignature::parse(_object*, _object*, _object**, bool)
```
Remaining TODOs:
* Include a timer in the generated script for cuda sync.
* Add valgrind to CircleCI machines and add a unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44717
Reviewed By: soumith
Differential Revision: D24010742
Pulled By: robieta
fbshipit-source-id: df6bc765f8efce7193893edba186cd62b4b23623
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42617
While we figure out the random plan, I want to initially disable
support for random operations. This is because there is an ambiguity in
what randomness means. For example,
```
tensor = torch.zeros(B0, 1)
vmap(lambda t: t.normal_())(tensor)
```
in the above example, should tensor[0] and tensor[1] be equal (i.e.,
use the same random seed), or should they be different?
The mechanism for disabling random support is as follows:
- We add a new dispatch key called VmapMode
- Whenever we're inside vmap, we enable VmapMode for all tensors.
This is done via at::VmapMode::increment_nesting and
at::VmapMode::decrement_nesting.
- DispatchKey::VmapMode's fallback kernel is the fallthrough kernel.
- We register kernels that raise errors for all random functions on
DispatchKey::VmapMode. This way, whenever someone calls a random
function on any tensor (not just BatchedTensors) inside of a vmap block,
an error gets thrown.
Test Plan: - pytest test/test_vmap.py -v -k "Operators"
Reviewed By: ezyang
Differential Revision: D22954840
Pulled By: zou3519
fbshipit-source-id: cb8d71062d4087e10cbf408f74b1a9dff81a226d
Summary:
This PR adds the `torch.linalg` namespace as part of our continued effort to be more compatible with NumPy. The namespace is tested by adding a single function, `torch.linalg.outer`, and testing it in a new test suite, test_linalg.py. It follows the same pattern that https://github.com/pytorch/pytorch/pull/41911, which added the `torch.fft` namespace, did.
Future PRs will likely:
- add more functions to torch.linalg
- expand the testing done in test_linalg.py, including legacy functions, like torch.ger
- deprecate existing linalg functions outside of `torch.linalg` in preference to the new namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42664
Reviewed By: ngimel
Differential Revision: D22991019
Pulled By: mruberry
fbshipit-source-id: 39258d9b116a916817b3588f160b141f956e5d0b
Summary:
This PR creates a new namespace, torch.fft (torch::fft) and puts a single function, fft, in it. This function is analogous to is a simplified version of NumPy's [numpy.fft.fft](https://numpy.org/doc/1.18/reference/generated/numpy.fft.fft.html?highlight=fft#numpy.fft.fft) that accepts no optional arguments. It is intended to demonstrate how to add and document functions in the namespace, and is not intended to deprecate the existing torch.fft function.
Adding this namespace was complicated by the existence of the torch.fft function in Python. Creating a torch.fft Python module makes this name ambiguous: does it refer to a function or module? If the JIT didn't exist, a solution to this problem would have been to make torch.fft refer to a callable class that mimicked both the function and module. The JIT, however, cannot understand this pattern. As a workaround it's required to explicitly `import torch.fft` to access the torch.fft.fft function in Python:
```
import torch.fft
t = torch.randn(128, dtype=torch.cdouble)
torch.fft.fft(t)
```
See https://github.com/pytorch/pytorch/issues/42175 for future work. Another possible future PR is to get the JIT to understand torch.fft as a callable class so it need not be imported explicitly to be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41911
Reviewed By: glaringlee
Differential Revision: D22941894
Pulled By: mruberry
fbshipit-source-id: c8e0b44cbe90d21e998ca3832cf3a533f28dbe8d
Summary:
According to pytorch/rfcs#3
From the goals in the RFC:
1. Support subclassing `torch.Tensor` in Python (done here)
2. Preserve `torch.Tensor` subclasses when calling `torch` functions on them (done here)
3. Use the PyTorch API with `torch.Tensor`-like objects that are _not_ `torch.Tensor`
subclasses (done in https://github.com/pytorch/pytorch/issues/30730)
4. Preserve `torch.Tensor` subclasses when calling `torch.Tensor` methods. (done here)
5. Propagating subclass instances correctly also with operators, using
views/slices/indexing/etc. (done here)
6. Preserve subclass attributes when using methods or views/slices/indexing. (done here)
7. A way to insert code that operates on both functions and methods uniformly
(so we can write a single function that overrides all operators). (done here)
8. The ability to give external libraries a way to also define
functions/methods that follow the `__torch_function__` protocol. (will be addressed in a separate PR)
This PR makes the following changes:
1. Adds the `self` argument to the arg parser.
2. Dispatches on `self` as well if `self` is not `nullptr`.
3. Adds a `torch._C.DisableTorchFunction` context manager to disable `__torch_function__`.
4. Adds a `torch::torch_function_enabled()` and `torch._C._torch_function_enabled()` to check the state of `__torch_function__`.
5. Dispatches all `torch._C.TensorBase` and `torch.Tensor` methods via `__torch_function__`.
TODO:
- [x] Sequence Methods
- [x] Docs
- [x] Tests
Closes https://github.com/pytorch/pytorch/issues/28361
Benchmarks in https://github.com/pytorch/pytorch/pull/37091#issuecomment-633657778
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37091
Reviewed By: ngimel
Differential Revision: D22765678
Pulled By: ezyang
fbshipit-source-id: 53f8aa17ddb8b1108c0997f6a7aa13cb5be73de0
Summary:
Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch.
Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available.
Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic.
Offers both Python and ATen interfaces
Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683
Differential Revision: D21998093
Pulled By: ezyang
fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38173
- Introduce torch.types.Device representing all "device-like" types
- Stubbed torch.device.__reduce__
- Stubbed all torch._C functions comprehensively
- Deleted _safe_call which is unused throughout the codebase
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21497399
Pulled By: ezyang
fbshipit-source-id: 1f534442b0ec9a70d556545d072f2c06a08b9d15
Summary:
It just depends on a single `torch_python` library.
C library does not depend on standard C++ library and as result it closes https://github.com/pytorch/pytorch/issues/36941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39375
Reviewed By: orionr
Differential Revision: D21840645
Pulled By: malfet
fbshipit-source-id: 777c189feee9d6fc686816d92cb9f109b8aac7ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35614
Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well.
Test Plan: CI
Differential Revision: D20842876
Pulled By: dreiss
fbshipit-source-id: 18abf0d324ed2185ec6d27c864e935d856dcc6ad
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37259, fixes https://github.com/pytorch/pytorch/issues/20156
This lazily calls `at::init_num_threads` once for each thread by adding a call to `lazy_init_num_threads` in `at::parallel_for` and `at::parallel_reduce`.
If this solution is okay, then we should add the same to guard other places that might use MKL or OpenMP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37461
Reviewed By: ezyang
Differential Revision: D21472763
Pulled By: ilia-cher
fbshipit-source-id: 889d6664f5bd4080037ade02ee324b1233992915
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771
Differential Revision: D21319650
Pulled By: anjali411
fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37527
This is yet another place that needs to be updated for adding a new "Backend" and is unnecessary. Instead, just use layout_from_backend and have a map from Layout -> THPLayout.
Other changes:
- rename torch::getDtype and torch::getLayout to torch::getTHPDtype and torch::getTHPLayout since e.g. for layout you are both passing in and returning a "layout" type.
- add NumOptions to Layout to match the dtype/ScalarType formulation.
Test Plan: Imported from OSS
Differential Revision: D21309836
Pulled By: gchanan
fbshipit-source-id: ede0e4f3bf7ff2cd04a9b17df020f0d4fd654ba3
Summary:
Reland of https://github.com/pytorch/pytorch/pull/35061 ; removed
the get qualified type name magic from debug strings to work around
MSVC 2017 bug.
Main points of the new API:
- You can register implementations (impl) without having to specify a schema.
- Registrations are commutative, so no matter what order your static
initializers run, you end up with the same end result.
op_registration_test.cpp contains a reasonably comprehensive accounting
for the available API surface
How does this implementation proceed? The basic concept is to relax the
internal invariants of Dispatcher data structures to allow the
possibility that a FunctionSchema is not specified in an Operator.
- DispatchKeyExtractor has an uninitialized state where it doesn't look
for dispatch keys in any arguments of the stack. It can have a
schema (de)registered to itself post facto with
registerSchema/unregisterSchema.
- DispatchTable has a new constructor taking only an OperatorName for
the uninitialized state. It can have a schema (de)registered to itself
post facto with registerSchema/unregisterSchema
- OperatorDef maintains counts of both defs and well as defs_and_impls.
defs_and_impls keeps track of the outstanding impl registrations; you
may have impl registrations but no defs. If there are no defs (no
schema), the operator is not returned by findSchema. A new
findOperatorByName fucntion unconditionally returns the OperatorHandle
even if there's no schema. OperatorHandle::hasSchema can be used
to check if the operator has schema.
- Replaced 'registerKernel' with 'registerImpl', which is the new
interface for directly registering kernels without implementations.
- Because 'registerImpl' no longer requires an OperatorHandle, change
'registerDef' to only return a RegistrationHandleRAII. This is marginally
less efficient (since we're doing two hash table lookups on a registration
now), but this won't matter in the long term, and probably doesn't
matter now either.
- Rename registerBackendFallbackKernel to registerFallback (this exposed
a bunch of places where we're improperly directly interfacing with Dispatcher;
we need to add this capability to the true public API)
- All code generated internal registrations are switched to use the new
API. This includes VariableType registrations (which previously
weren't converted) and the mobile autograd stuff
- Switch the new-style def()/impl() APIs to interact directly with Dispatcher,
rather than indirecting through the old API
- We deleted alias analysis kind merging entirely. As a nod to BC, it's
possible to define a full schema with alias analysis kind, and then
later do another full schema def with missing alias analysis kind, but
the opposite direction is not allowed. We can remove this entirely
following the plan at https://github.com/pytorch/pytorch/issues/35040
- Schema matching is moved inside the dispatcher, because we might not
be able to immediately schema match at the point of an impl() (because
we don't have the schema yet). To do this, we store the inferred
function schema inside a KernelEntry, so we can check it when we get
the real schema.
- Registered kernel functions now store a debug string which
can be used to more easily identify them. Tests use this to
distinguish between multiple distinct registrations; regular
invocations get only very basic information.
Because we need our static initializers to work no matter what order
they're run, the testing strategy on this PR is quite involved.
The general concept:
- Bind a (very gimped) version of the dispatcher API from Python,
so that we can easily write a more complex testing harness
using expect tests.
- For series of registrations we want to test, exhaustively
test every possible permutation of registrations (and
deregistrations), and show that the intermediate states
agree no matter what path is taken.
- Intermediate states are rendered using a new dumpState()
debugging method that prints the internal state of the
dispatcher. This method may be generally useful for people
who want to see what's in the dispatcher.
- Simultaneously, add a new invariant testing function which
checks that the internal invariants of the dispatcher are
upheld (so we don't have to print internal implementation
details of the dispatcher)
The testing framework found a few bugs in development. For example,
here is a case where we registered schema too early, before checking
if it was valid:
```
Traceback (most recent call last):
File "test/test_dispatch.py", line 164, in test_def_impl_schema_mismatch
], raises=True)
File "test/test_dispatch.py", line 135, in commute
results=results, raises=raises)
File "test/test_dispatch.py", line 83, in run_permutation
.format(ctor_order[:i], op_ix))
File "test/test_dispatch.py", line 59, in check_invariants
.format(expected_provenance, actual_provenance)
AssertionError: 'name[16 chars]ema: (none)\ncatchall: boxed unboxed :: (Tenso[18 chars]0)\n' != 'name[16 chars]ema: test::foo(Tensor x, Tensor y) -> (Tensor)[53 chars]0)\n'
name: test::foo
- schema: (none)
+ schema: test::foo(Tensor x, Tensor y) -> (Tensor)
catchall: boxed unboxed :: (Tensor _0) -> (Tensor _0)
: expected from running ctors (1,); actual from running ctors (1,) and then failing to run ctor 0 (did this failure leave the dispatcher in a wedged state? it shouldn't!)
```
There are also C++ smoketests for the API. These tests comprehensively
cover the C++ API surface of the new operator registration API, but
don't check very hard if the API does the right thing (that's what
test_dispatch.py is for)
Some miscellaneous changes which could have been split into other
PRs, but I was too lazy to do so:
- Add torch::jit::parseName (mirroring parseSchema/parseSchemaOrName)
- Add cloneWithName functionality to FunctionSchema
- Unconditionally generate schema registration, even when type_method_dispatch
is a dict. The one exception is for manual registrations....
- Add fallback, CppFunction::makeFallthrough and
CppFunction::makeFromBoxedFunction to public API of op_registration, so we can
stop calling internal registerImpl directly
- Add new syntax sugar dispatch_autograd for registering autograd kernels
- Minor OperatorName cleanup, storing OperatorName in DispatchTable
and defining operator<< on OperatorName
- Refactored the op registration API to take FunctionSchema directly.
We now do namespacing by post facto fixing up the OperatorName
embedded in FunctionSchema. This also means that you can
now do torch::import("ns1").def("ns2::blah") and have the ns2
override ns1 (although maybe this is not the correct behavior.)
- New torch::schema public API, for attaching alias analysis kind
annotation kinds. This meant we had to template up some function
signatures which previously took const char*. There's now a nice
comment explaining this strategy.
- torch::import now takes std::string which means we can use
the namespacing from Python
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35629
Differential Revision: D20724551
Pulled By: ezyang
fbshipit-source-id: befa46a1affb4ec4ae1fb39e3564a63695a6ca41
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35061
Main points of the new API:
- You can register implementations (impl) without having to specify a schema.
- Registrations are commutative, so no matter what order your static
initializers run, you end up with the same end result.
op_registration_test.cpp contains a reasonably comprehensive accounting
for the available API surface
How does this implementation proceed? The basic concept is to relax the
internal invariants of Dispatcher data structures to allow the
possibility that a FunctionSchema is not specified in an Operator.
- DispatchKeyExtractor has an uninitialized state where it doesn't look
for dispatch keys in any arguments of the stack. It can have a
schema (de)registered to itself post facto with
registerSchema/unregisterSchema.
- DispatchTable has a new constructor taking only an OperatorName for
the uninitialized state. It can have a schema (de)registered to itself
post facto with registerSchema/unregisterSchema
- OperatorDef maintains counts of both defs and well as defs_and_impls.
defs_and_impls keeps track of the outstanding impl registrations; you
may have impl registrations but no defs. If there are no defs (no
schema), the operator is not returned by findSchema. A new
findOperatorByName fucntion unconditionally returns the OperatorHandle
even if there's no schema. OperatorHandle::hasSchema can be used
to check if the operator has schema.
- Replaced 'registerKernel' with 'registerImpl', which is the new
interface for directly registering kernels without implementations.
- Because 'registerImpl' no longer requires an OperatorHandle, change
'registerDef' to only return a RegistrationHandleRAII. This is marginally
less efficient (since we're doing two hash table lookups on a registration
now), but this won't matter in the long term, and probably doesn't
matter now either.
- Rename registerBackendFallbackKernel to registerFallback (this exposed
a bunch of places where we're improperly directly interfacing with Dispatcher;
we need to add this capability to the true public API)
- All code generated internal registrations are switched to use the new
API. This includes VariableType registrations (which previously
weren't converted) and the mobile autograd stuff
- Switch the new-style def()/impl() APIs to interact directly with Dispatcher,
rather than indirecting through the old API
- We deleted alias analysis kind merging entirely. As a nod to BC, it's
possible to define a full schema with alias analysis kind, and then
later do another full schema def with missing alias analysis kind, but
the opposite direction is not allowed. We can remove this entirely
following the plan at https://github.com/pytorch/pytorch/issues/35040
- Schema matching is moved inside the dispatcher, because we might not
be able to immediately schema match at the point of an impl() (because
we don't have the schema yet). To do this, we store the inferred
function schema inside a KernelEntry, so we can check it when we get
the real schema.
- Registered kernel functions now store a debug string which
can be used to more easily identify them. There's some best
effort stuff based on __FUNCSIG__ but this is only really
capable of reporting types and not function symbols. Tests
use this to distinguish between multiple distinct registrations.
Because we need our static initializers to work no matter what order
they're run, the testing strategy on this PR is quite involved.
The general concept:
- Bind a (very gimped) version of the dispatcher API from Python,
so that we can easily write a more complex testing harness
using expect tests.
- For series of registrations we want to test, exhaustively
test every possible permutation of registrations (and
deregistrations), and show that the intermediate states
agree no matter what path is taken.
- Intermediate states are rendered using a new dumpState()
debugging method that prints the internal state of the
dispatcher. This method may be generally useful for people
who want to see what's in the dispatcher.
- Simultaneously, add a new invariant testing function which
checks that the internal invariants of the dispatcher are
upheld (so we don't have to print internal implementation
details of the dispatcher)
The testing framework found a few bugs in development. For example,
here is a case where we registered schema too early, before checking
if it was valid:
```
Traceback (most recent call last):
File "test/test_dispatch.py", line 164, in test_def_impl_schema_mismatch
], raises=True)
File "test/test_dispatch.py", line 135, in commute
results=results, raises=raises)
File "test/test_dispatch.py", line 83, in run_permutation
.format(ctor_order[:i], op_ix))
File "test/test_dispatch.py", line 59, in check_invariants
.format(expected_provenance, actual_provenance)
AssertionError: 'name[16 chars]ema: (none)\ncatchall: boxed unboxed :: (Tenso[18 chars]0)\n' != 'name[16 chars]ema: test::foo(Tensor x, Tensor y) -> (Tensor)[53 chars]0)\n'
name: test::foo
- schema: (none)
+ schema: test::foo(Tensor x, Tensor y) -> (Tensor)
catchall: boxed unboxed :: (Tensor _0) -> (Tensor _0)
: expected from running ctors (1,); actual from running ctors (1,) and then failing to run ctor 0 (did this failure leave the dispatcher in a wedged state? it shouldn't!)
```
There are also C++ smoketests for the API. These tests comprehensively
cover the C++ API surface of the new operator registration API, but
don't check very hard if the API does the right thing (that's what
test_dispatch.py is for)
Some miscellaneous changes which could have been split into other
PRs, but I was too lazy to do so:
- Add torch::jit::parseName (mirroring parseSchema/parseSchemaOrName)
- Add cloneWithName functionality to FunctionSchema
- Unconditionally generate schema registration, even when type_method_dispatch
is a dict. The one exception is for manual registrations....
- Add fallback, CppFunction::makeFallthrough and
CppFunction::makeFromBoxedFunction to public API of op_registration, so we can
stop calling internal registerImpl directly
- Add new syntax sugar dispatch_autograd for registering autograd kernels
- Minor OperatorName cleanup, storing OperatorName in DispatchTable
and defining operator<< on OperatorName
- Refactored the op registration API to take FunctionSchema directly.
We now do namespacing by post facto fixing up the OperatorName
embedded in FunctionSchema. This also means that you can
now do torch::import("ns1").def("ns2::blah") and have the ns2
override ns1 (although maybe this is not the correct behavior.)
- New torch::schema public API, for attaching alias analysis kind
annotation kinds. This meant we had to template up some function
signatures which previously took const char*. There's now a nice
comment explaining this strategy.
- torch::import now takes std::string which means we can use
the namespacing from Python
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D20680520
Pulled By: ezyang
fbshipit-source-id: 5d39a28e4ec7c73fe4b1fb2222e865ab65e188f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636
Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072
Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.
In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).
This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.
Differential Revision: D20019236
fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34047
This PR integrates the added xnnpack conv2d and linear op via
custom class registration for packed weights. The packed struct
is serializable.
Test Plan:
python test test/test_xnnpack_integration.py
Imported from OSS
Differential Revision: D20185657
fbshipit-source-id: fc7e692d8f913e493b293b02d92f4e78536d7698
Summary:
Closes https://github.com/pytorch/pytorch/issues/30027
The idea here is that you can bind a function with `pybind11` in a single line and without modifying the function:
```cpp
m.def("foo", foo, py::call_guard<torch::PyWarningHandler>());
```
Where warnings are handled by the [`call_guard`](https://pybind11.readthedocs.io/en/stable/advanced/functions.html#call-guard) and exceptions are handled by the `pybind11` exception translator. To do this, I have added support for handling C++ exceptions in `torch::PyWarningHandler`'s destructor without setting the python error state before hand.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30588
Differential Revision: D19905626
Pulled By: albanD
fbshipit-source-id: 90c0a5e298b123cc0c8ab9c52c91be4e96ea47c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29986
Previously in addition to generating a python binding for each op,
we would generate an almost-trivial helper for each overload.
This PR eliminates the helpers, simplifying codegen logic a bit and
reducing the source-level indirection by a step.
Perf should be unchanged.
codegen diff: 1f2f07fb60
Note: in the interests of keeping the diff contained, there's only
some light cleanup here beyond what's necessary for the codegen changes.
Plan is to do some more substantial refactoring in followup PRs that
leave generated code unchanged.
Test Plan: Imported from OSS
Differential Revision: D18567980
Pulled By: bhosmer
fbshipit-source-id: eb9a81babb4489abd470842757af45580d4c9906
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31117
After this diff, we will have completely removed the named tensor
feature flagging. This means that named tensors are always on and that
there is no mechanism to turn them off. There should be no more follow-up
diffs.
I performed the deletion of the header with
```
find . -type f -print0 | xargs -0 sed -i '/#include
<ATen\/core\/EnableNamedTensor.h>/d'
```
Test Plan: - wait for CI
Differential Revision: D18934952
Pulled By: zou3519
fbshipit-source-id: 253d059074b910fef15bdf885ebf71e0edf5bea5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31116
Changelist:
- remove BUILD_NAMEDTENSOR macro
- remove torch._C._BUILD_NAMEDTENSOR
- remove all python behavior that relies on torch._C._BUILD_NAMEDTENSOR
Future:
- In the next diff, I will remove all usages of
ATen/core/EnableNamedTensor.h since that header doesn't do anything
anymore
- After that, we'll be done with the BUILD_NAMEDTENSOR removal.
Test Plan: - run CI
Differential Revision: D18934951
Pulled By: zou3519
fbshipit-source-id: 0a0df0f1f0470d0a01c495579333a2835aac9f5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30894
This PR begins the process of removing BUILD_NAMEDTENSOR macros. There
will be followups.
Reasons for removing the macros:
- BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0.
- Since we don't test building without it, it is useless to keep around.
- Code becomes nicer to read without the macros
Reasons for not removing the macros:
- potential for feature flagging
Now, I argue against needing to feature flag. The main reason why we
might want to feature flag is if we need to disable the feature.
We'd need a fast switch to disable the feature if someone discovers
in the future that named tensors caused some regression in some existing workflows.
In https://github.com/pytorch/pytorch/pull/25798, I did a variety of
macro- and micro- benchmarks to determine the performance impact of named
tensors on regular tensors.
[The
microbenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-529014810)
were not very stable, and running the
microbenchmarks for more iterations doesn't actually help because the
noise is not distributed in a nice way. Instead of microbenchmarks I ran
a [profiler
(perf)](https://github.com/pytorch/pytorch/pull/25798#issuecomment-555707645)
to estimate how much overhead named tensors add to unnamed code. I
estimated the overhead to be less than 100ns for `add` and even smaller
for `mm`; there are ways to optimize even futher if we find this to be a
problem.
[Initial
macrobenchmarks](https://github.com/pytorch/pytorch/pull/25798#issuecomment-530539104)
were also not very stable. I ran imagenet for some number of epochs. To
make them more stable, I got rid of the data loading (which seemed to
vary between runs). [In some benchmarkers without data
loading](https://github.com/pytorch/pytorch/pull/25798#issuecomment-562214053),
we can see that the results are less noisy now. These results support
no noticeable regressions in speed.
Test Plan: - wait for CI
Differential Revision: D18858543
Pulled By: zou3519
fbshipit-source-id: 08bf3853a9f506c6b084808dc9ddd1e835f48c13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29213
A trivial use of make_variable is one where requires_grad=False. This
transformation is not technically semantics preserving, as make_variable
will create a shallow copy of the tensor in question; however, I
am guessing that we have the invariant that we don't actually make
use of this shallow copy in a nontrivial way.
There were some cases where the surrounding code expected a Variable proper
to be returned; I retained those sites.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D18353503
Pulled By: ezyang
fbshipit-source-id: 57fe34d82e009c0cc852266fb0b79d6d9c62bb03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26620
This change updates torch.backend.quantized.engine to accept string ("fbgemm"/"qnnpack"/"none" for now).
set_qengine and get_qengine return an int which represents the at::QEngine enum
Test Plan:
python test/test_torch.py
Imported from OSS
Differential Revision: D17533582
fbshipit-source-id: 5103263d0d59ff37d43dec27243cb76ba8ba633f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26135
This change adds the support to call QNNPACK using the refactored API for Linear operators (Fully Connected)
It also has certain cmake changes to enable builing and using pytorch_qnnpack inside aten
I have disabled USE_QNNPACK in CMakeLists.txt. Enabling it results in picking kernels from third_party/QNNPACK during runtime since the function names are the same.
Test Plan:
python test/test_quantized.py TestQNNPackOps.test_qlinear_qnnpack
Imported from OSS
Differential Revision: D17434885
fbshipit-source-id: 084698026938f4529f61d12e86dfe82534ec73dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26060
This PR enables BUILD_NAMEDTENSOR by default. This is done via including
a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`.
In the future, the plan is to get rid of the flag entirely: we can
incrementally delete usages after this PR goes in.
This PR also maintains the namedtensor ci vs regular ci distinction.
`test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified.
TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this
distinction later and send out an announcement about it; devs will be
responsible for named tensor failures after that.
The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we
could quickly prototype named tensor features without worrying about
adding overhead to the framework. The overheads can be categorized as
memory overhead and performance overhead.
Memory overhead: named tensors adds 1 additional word per Tensor. This
is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>`
field. This is not a lot of overhead.
Performance overhead: At all entry points to name inference, we check
if inputs to an op are named. If inputs are not named, we short-circuit
and don't do name inference. These calls should therefore be as
efficient as error-checking code and not take up a lot of time.
My plan is to benchmark a few functions and then post the results in a
comment to this PR.
Test Plan: - [namedtensor ci]
Differential Revision: D17331635
Pulled By: zou3519
fbshipit-source-id: deed901347448ae2c26066c1fa432e3dc0cadb92
Summary:
Follow-up to gh-25483, more of the same fixes for warnings like:
```
../torch/csrc/autograd/python_variable.cpp:503:31: warning: cast between incompatible function types from ‘PyObject* (*)(THPVariable*)’ {aka ‘_object* (*)(THPVariable*)’} to ‘getter’ {aka ‘_object* (*)(_object*, void*)’} [-Wcast-function-type]
503 | {"_backward_hooks", (getter)THPVariable_get_backwards_hooks, (setter)THPVariable_set_backwards_hooks, nullptr, nullptr},
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This takes the build log output for a full rebuild with GCC 9.1 from ~10,000 to ~7,000 lines.
`clang-tidy` is going to complain, no way around that - see discussion at the end of gh-25483.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26104
Differential Revision: D17396831
Pulled By: ezyang
fbshipit-source-id: d71696bfe4dbe25519e4bcb7753151c118bd39f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25680
Add a runtime flag to choose between FBGEMM and QNNPACK when compiled with both.
The flag can be set by using torch.backends.quantized.engine = torch.fbgemm/torch.qnnpack or ctx::setPreferredQuantizedEngine(at::QEngine)
ghstack-source-id: 89935643
Test Plan: Verified torch.backends.quantized.engine works
Differential Revision: D17198233
fbshipit-source-id: e5449d06f4136385e0e6d18bd4237f8654a61672
Summary:
This PR is about add torch.backends.mkldnn.enabled flag said in https://github.com/pytorch/pytorch/issues/25186 which can be used disable mkldnn at runtime step as torch.backends.cudnn.enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25459
Differential Revision: D17258926
Pulled By: ezyang
fbshipit-source-id: e179ad364cc608fdaa7d0f37e2e762ceb5eda598
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25352
It doesn't appear to be necessary anymore; assuming this works I'll kill the codegen in a follow-up PR.
Test Plan: Imported from OSS
Differential Revision: D17101573
Pulled By: gchanan
fbshipit-source-id: bd3d1724ee5c659185a161b1e291e30af52f0a8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24875
As per https://github.com/pytorch/pytorch/issues/23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.
DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.
The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.
Sample python code on how this would be used for distributed autograd:
```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
# forward pass...
# backward pass...
# optimizer step...
```
ghstack-source-id: 89119248
Test Plan: unit tests.
Differential Revision: D16356694
fbshipit-source-id: d1a8678da0c2af611758dbb5d624d554212330ce
Summary:
https://github.com/pytorch/pytorch/pull/23228 caused build failure on OSX, because rpc.h is included as long as USE_DISTRIBUTED=1, but rpc/init.cpp (and others) is only included when NOT APPLE. So, it cannot find python_functions defined in init.cpp on MacOS. This PR attempt to fix it by wrapping rpc.h with USE_C10D, which is only set when NOT APPLE.
I tried this fix locally and it works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23998
Differential Revision: D16706087
Pulled By: mrshenli
fbshipit-source-id: d04fe6717a181a3198289cdef51439708c2e291d
Summary:
Features:
* sync and async RPC for builtin operators
* RpcAgent API
* ProcessGroupAgent implementation
Goal:
* have a minimum working and testable RPC implementation
* make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation
* For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object.
* For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...).
* support blocking and non-blocking RequestCallback
* blocking means the callback won't return before sending out the response
* non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list.
We are not exporting this diff until we finalize distributed autograd design and publish the API review publicly.
https://fb.quip.com/FabTAZKVgQpf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23228
ghstack-source-id: 87816717
Reviewed By: zhaojuanmao
Differential Revision: D15194693
fbshipit-source-id: 7adb600796613cde6073db6c227451b89940ecaf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23623
This is a quick, not-user-facing check for if pytorch was built with BUILD_NAMEDTENSOR=1.
Test Plan:
- run tests [namedtensor ci]
gh-metadata: pytorch pytorch 23623 gh/zou3519/85/head
Differential Revision: D16621829
Pulled By: zou3519
fbshipit-source-id: d7e1161dc176bab2c1f953265722daeba1e63102
Summary:
we used to not print device when it's on xla. It's sometimes confusing as it looks the same as cpu tensor...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22094
Differential Revision: D15975405
Pulled By: ailzhang
fbshipit-source-id: f19ceb9e26f5f2f6e7d659de12716f0dfe065f42
Summary:
This is useful for measuring inference performance of your
models. This is a very basic benchmark for now. We don't support
batching on the benchmark side, no inter and intra op parallelizm is
supported yet, just caller based parallelizm.
Main phylosophy here is that user should be able to provide inputs
from python and just stack them within the benchmark. API should be
exactly the same as passing inputs to module.forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20766
Test Plan: Added a new unit test
Differential Revision: D15435461
Pulled By: salexspb
fbshipit-source-id: db08829dc3f4398bb1d8aa16cc4a58b6c72f16c6
Summary:
Resubmit #20698 which got messed up.
Idea is that when PyTorch is used in a custom build environment (e.g. Facebook), it's useful to track usage of various APIs centrally. This PR introduces a simple very lightweight mechanism to do so - only first invocation of a trigger point would be logged. This is significantly more lightweight than #18235 and thus we can allow to put logging in e.g. TensorImpl.
Also adds an initial list of trigger points. Trigger points are added in such a way that no static initialization triggers them, i.e. just linking with libtorch.so will not cause any logging. Further suggestions of what to log are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20745
Differential Revision: D15429196
Pulled By: dzhulgakov
fbshipit-source-id: a5e41a709a65b7ebccc6b95f93854e583cf20aca
Summary:
As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR:
1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class
2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()`
3. Remove `Variable.data()` API
3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history.
After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't.
**Note that this PR is BC-breaking in the following use cases:**
**Use Case 1:**
Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type.
**Use Case 2:**
If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example:
```python
params = torch.tensor([1.5, 1.5]).requires_grad_()
with torch.no_grad():
# Change gradient to a sparse tensor
params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.]))
grad_saved = params.grad
params.backward(torch.tensor([1.5, 1.5]))
assert id(grad_saved) == id(params.grad) # This will fail after this PR
```
The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072
Differential Revision: D14075257
Pulled By: yf225
fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957
Summary:
#19975 was separated by 2 PRs.
This one:
Introduce MemoryFormat argument to the `x.is_contiguous(memory_format=torch.channels_last)` and to the `y = x.contiguous(memory_format=torch.channels_last)` functions.
At this moment both functions just operate with strides and doesn't store any tensor state.
(Original RFC #19092)
-----
Expands functionality of two tensor functions `.is_contiguous` and `.contiguous` (both python and c++ api).
Note: We had several complaints about `.to(memory_format)` function, and decided not to support it.
1. `.contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- Using `torch.contiguous_format` will preserve existing `.contiguous()` behavior.
- Calling `x.contiguous(memory_format=torch.channels_last)` returns new tensor which maintain same semantical layout (NCHW), but have different memory allocation pattern.
`x.contiguous(memory_format=torch.channels_last)` expects input tensor to be 3d, 4d or 5d; and fails otherwise.
2. `.is_contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- `x.is_contiguous(memory_format=torch.contiguous_format)` preserves same functionality as `x.is_contiguous()` and remains unchanged.
- `x.is_contiguous(memory_format=torch.channels_last)` returns true if A) input tensor is contiguous in memory AND B) allocated in the memory in NWHC (or similar for 3d,5d) format.
Note: By the end of the phase one `x.is_contiguous(memory_format=torch.channels_last)` will calculate state of the Tensor on every call. This functionality going to be updated later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20455
Differential Revision: D15341577
Pulled By: VitalyFedyunin
fbshipit-source-id: bbb6b4159a8a49149110ad321109a3742383185d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18362
ghimport-source-id: 374b7ab97e2d6a894368007133201f510539296f
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18242 Test running a CUDA build on CPU machine.
* **#18362 Add ability to query if built with CUDA and MKL-DNN.**
Fixes#18108.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14584430
fbshipit-source-id: 7605a1ac4e8f2a7c70d52e5a43ad7f03f0457473
Summary:
This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this:
0. Storage Implementation (this change)
1. Tensor Creation.
2. Tensor Conversions.
3. Tensor Indexing.
4. Tensor Operations.
5. Back compatibility related changes.
This feature was requested by the community:
https://github.com/pytorch/pytorch/issues/4764https://github.com/pytorch/pytorch/issues/4219https://github.com/pytorch/pytorch/issues/4288
**Change**:
Added boolean type to the Storage class for CPU and CUDA backends.
**Tested via**:
1. unit tests
2. running this:
-> import torch
-> torch.BoolStorage
<class 'torch.BoolStorage'>
-> torch.cuda.BoolStorage
<class 'torch.cuda.BoolStorage'>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16810
Reviewed By: gchanan
Differential Revision: D14087246
Pulled By: izdeby
fbshipit-source-id: 042642ced1cb0fd1bb6bff05f9ca871a5c54ee5e
Summary:
1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation.
2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++
3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937
Differential Revision: D13649001
Pulled By: mrshenli
fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240
Summary:
This PR enables C++ frontend modules to be bound into Python and added as submodules of Python modules. For this, I added lots of pybind11 bindings for the `torch::nn::Module` class, and modified the `torch.nn.Module` class in Python to have a new Metaclass that makes `isinstance(m, torch.nn.Module)` return true when `m` is a C++ frontend module. The methods and fields of C++ modules are bound in such a way that they work seamlessly as submodules of Python modules for most operations (one exception I know of: calling `.to()` ends up calling `.apply()` on each submodule with a Python lambda, which cannot be used in C++ -- this may require small changes on Python side).
I've added quite a bunch of tests to verify the bindings and equality with Python. I think I should also try out adding a C++ module as part of some large PyTorch module, like a WLM or something, and see if everything works smoothly.
The next step for inter-op across our system is ScriptModule <-> C++ Frontend Module inter-op. I think this will then also allow using C++ frontend modules from TorchScript.
apaszke zdevito
CC dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13481
Differential Revision: D12981996
Pulled By: goldsborough
fbshipit-source-id: 147370d3596ebb0e94c82cec92993a148fee50a7
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
If torch.multiprocessing.spawn is used to launch non-daemonic
processes (the default since #14391), the spawned children won't be
automatically terminated when the parent terminates.
On Linux, we can address this by setting PR_SET_PDEATHSIG, which
delivers a configurable signal to child processes when their parent
terminates.
Fixes#14394.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14491
Differential Revision: D13270374
Pulled By: pietern
fbshipit-source-id: 092c9d3c3cea2622c3766b467957bc27a1bd500c
Summary:
Add to the Tensor doc info about `.device`, `.is_cuda`, `.requires_grad`, `.is_leaf` and `.grad`.
Update the `register_backward_hook` doc with a warning stating that it does not work in all cases.
Add support in the `_add_docstr` function to add docstring to attributes.
There is an explicit cast here but I am not sure how to handle it properly. The thing is that the doc field for getsetdescr is written as being a const char * (as all other doc fields in descriptors objects) in cpython online documentation. But in the code, it is the only one that is not const.
I assumed here that it is a bug in the code because it does not follow the doc and the convention of the others descriptors and so I cast out the const.
EDIT: the online doc I was looking at is for 3.7 and in that version both the code and the doc are const. For older versions, both are non const.
Please let me know if this should not be done. And if it should be done if there is a cleaner way to do it !
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14339
Differential Revision: D13243266
Pulled By: ezyang
fbshipit-source-id: 75b7838f7cd6c8dc72b0c61950e7a971baefaeeb
Summary:
This is the next minimal step towards moving _C into cmake. For now,
leave _C in setup.py, but reduce it to an empty stub file. All of its
sources are now part of the new torch-python cmake target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12742
Reviewed By: soumith
Differential Revision: D13089691
Pulled By: anderspapitto
fbshipit-source-id: 1c746fda33cfebb26e02a7f0781fefa8b0d86385
Summary:
There are still a few work to be done:
- Move logging and unify AT_WARN with LOG(ERROR).
- A few header files are still being plumbed through, need cleaning.
- caffe2::EnforceNotMet aliasing is not done yet.
- need to unify the macros. See c10/util/Exception.h
This is mainly a codemod and not causing functional changes. If you find your job failing and trace back to this diff, usually it can be fixed by the following approaches:
(1) add //caffe2/c10:c10 to your dependency (or transitive dependency).
(2) change objects such as at::Error, at::Optional to the c10 namespace.
(3) change functions to the c10 namespace. Especially, caffe2::MakeString is not overridden by the unified c10::str function. Nothing else changes.
Please kindly consider not reverting this diff - it involves multiple rounds of rebasing and the fix is usually simple. Contact jiayq@ or AI Platform Dev for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12354
Reviewed By: orionr
Differential Revision: D10238910
Pulled By: Yangqing
fbshipit-source-id: 7794d5bf2797ab0ca6ebaccaa2f7ebbd50ff8f32
Summary:
To illustrate the benefits of this commit, I'll use the time/iter I got from one of the JIT benchmarks on my machine.
| Run | Time |
|----------------------------------------------|-------------------------|
| No profiler | 45ms |
| With profiler | 56ms |
| Use `clock_gettime` instead of `std::chrono` | 48ms |
| Touch all pages on block allocation | 48ms (less jitter) |
| Use `const char*` instead of `std::string` | 47ms (even less jitter) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11773
Differential Revision: D9886858
Pulled By: apaszke
fbshipit-source-id: 58f926f09e95df0b11ec687763a72b06b66991d0
Summary:
Will use USE_DISTRIBUTED for both c10d and THD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11237
Differential Revision: D9647825
Pulled By: teng-li
fbshipit-source-id: 06e0ec9b5e2f8f38780fc88718f8499463e9e969
Summary:
This was lingering after #10731.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11240
Differential Revision: D9645437
Pulled By: pietern
fbshipit-source-id: d02c33354b094be3bb0872cf54a45721e20c4e7d
Summary:
How did we get so many uses of `NULL` again?
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11047
Differential Revision: D9566799
Pulled By: goldsborough
fbshipit-source-id: 83469f352ac69aa65bdaf1a1a21f922d892e0db3
Summary:
Currently our `skipIfLapack` has uses a try-catch block and regex match the error message. It is highly unreliable. This PR adds `hasLAPACK` and `hasMAGMA` on ATen context, and expose the flags to python.
Also fixes refcounting bug with `PyModule_AddObject`. The method steals reference, but we didn't `Py_INCREF` in some places before calling it with `Py_True` or `Py_False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11024
Differential Revision: D9564898
Pulled By: SsnL
fbshipit-source-id: f46862ec3558d7e0058ef48991cd9c720cb317e2
Summary:
These could use some autograd tests, which are coming in a later PR, but using them in autograd is probably pretty rare.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9947
Reviewed By: ezyang
Differential Revision: D9032778
Pulled By: gchanan
fbshipit-source-id: fa5a6509d3bac31ea4fae25143e82de62daabfbd
Summary:
Usually DLPack consumer is expected to call DLManagedTensor's
deleter to signal that it doesn't need the contents.
This patch calls the deleter when freeing unconsumed
DLPack capsules created by PyTorch.
Test script:
```
import torch
import torch.utils.dlpack
import gc
for i in range(10000):
a = torch.randn(1000,1000, dtype=torch.float32, device='cuda')
b = torch.utils.dlpack.to_dlpack(a)
gc.collect()
```
Before patch: consume all GPU ram.
After patch: constant GPU ram consumption.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9297
Differential Revision: D8781571
Pulled By: soumith
fbshipit-source-id: 2ebadec6c857646220d632ca64110af430dbd52f
* Bag of fixes
* Rename tensor_range.h to tensor_list_view.h
* Post rebase fixes
* Rename torch::tensor namespace to torch::tensors due to name conflict
* Avoid recursion in Module::to
* Some 0-sized dimension support, port catArray away from resizeLegacy.
The goal of this PR is to port catArray away from resizeLegacy (so we can delete the legacy resize calls), but since catArray has some weird behavior because
we don't have arbitrary 0-sized dimension support, I made some effort to fix these both in one pass.
The major changes here are:
1) catArray uses the new resize API, no longer the old resizeLegacy API.
2) As 1) is the last usage of resizeLegacy, it is deleted.
3) If compiled with USE_TH_SIZE_ZERO_DIM, catArray will work and properly check shapes for n-dimensional empty tensors.
4) However, we retain the old behavior of "ignoring" size [0] tensors in catArray. We previously allowed this because we didn't have n-dimensional empty tensors.
5) To get the above to work, we also add support for n-dimensional empty tensors for narrow and slice (ifdef USE_TH_SIZE_ZERO_DIM).
6) We change the stride formula for empty tensors to match NumPy; basically, we never multiply by 0 as the size, always at least 1, so the
strides are monotonically increasing in the empty tensor case.
7) We print the size of empty tensors if size != [0]; this matches NumPy behavior (even in cases where the size could be inferred from the brackets.
8) For test purposes, we add torch._C._use_zero_size_dim() to add tests for the above.
* Fix flake8.
* Address review comments.
* Build and install c10d from tools/build_pytorch_libs.sh
* Create initial Python bindings for c10d
* clang-format
* Switch link order to include more symbols
* Add bindings and tests for ProcessGroupGloo
* Add broadcast test
* Separate build flag for c10d
* Explicit PIC property
* Skip c10d tests if not available
* Remove c10d from Windows blacklist
Let it skip by itself because it won't be available anyway.
* Make lint happy
* Comments
* Move c10d module into torch.distributed
* Close tempfile such that it is deleted
* Test if ASAN is actually working as part of ASAN tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Drop explicit use of libstdc++, we should not care.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Build with DEBUG=1
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Increase main thread stack size when using ASAN.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This makes the JIT tracer much more robust, by allowing it to record
dependencies on tensor sizes. For example, if you were to trace this
function
def fn(x):
return x.view(x.size(1), -1)
before this patch, then it would embed the actual value of x.size(1)
in the trace as a constant, making it very hard to have e.g. batch size
independent traces. Now, this will correctly record the dependency, and
will retrieve the size of x at every run.
* Split set_default_tensor_type(dtype) into set_default_dtype(dtype).
* Fix flake8.
The difference between this one and set_default_tensor_type is that it only sets scalar type what determines the type + device of a tensor returned from a factory function with defaults is the default tensor type + the current device (if the default tensor type is cuda). This just changes the scalar type of the default tensor type.
We do eventually want to deprecate set_default_tensor_type; it is not clear how to do that in a sensible and backwards compatible way.
* Separate cuda-ness from dtype.
There are no longer torch.cuda.int64, etc; only torch.int64 that correspond to at::ScalarType.
At the python arg parser level, the corresponding ATen type is selected from the combination of (ScalarType, Layout, Device).
There is also currently unused code in here for support ScalarType in native_functions; this will be used for specifying aggregate types
on reduction functions.
* Fix test_autograd.
* Add defaults to randint_like.
* Track is_cuda in py tensor types.
* Fix test_sparse.
* Fix multiprocessing.
* Fix rnn.
* Fix test_nn.
* Fix flake8.
* Add string-style devices to all tensors.
Previously, tensors only had a 'get_device' method which would throw an exception on a CPU tensor. This made it necessary to if/else code that
was meant to be device agnostic.
This PR implements the following:
1) Adds a 'device' property to all tensors that returns a string representation of the device for all tensors.
For cpu tensors this is 'cpu'. For cuda tensors this is 'cuda:X', where X is the cuda device ordinal.
2) Adds a DeviceSpec class. This is just a helper class for separating device_type and device_index specification and to allow partial specification.
For example, you can call DeviceSpec('cuda'), DeviceSpec('cuda:0'), DeviceSpec('cuda', 1).
Also has backwards compatibility support for specifying integers, which are treated as cuda devices.
DeviceSpecs have the following properties:
a) device_type: string representation of the device type (i.e. 'cpu' or 'cuda')
b) device_index: integer for the device index (None if not specified)
c) cuda_device_index: for backwards compatibility; behaves roughly like `get_device` did previously. I.e. if a function previously took integers for cuda devices,
it can now take DeviceSpecs (or strings), and can maintain the old functionality by calling `old_index = DeviceSpec(old).cuda_device_index`.
3) tensor methods and torch. functions that took integer devices can now take integers, strings, or DeviceSpecs. For example:
torch.randn((2,3), dtype=torch.cuda.float32, device='cuda:1')
TODO in future PRs:
A) Split out cuda from dtype so you don't need to overspecify cuda-ness
B) We currently only support strings/DeviceSpecs in tensor methods and torch. functions. We should have equivalents torch.cuda.device(...), torch.cuda.device_of, etc.
at the torch. level that work on strings/DeviceSpecs
* Add deviceInt64 to python arg parser.
* device_str.
* Remove device_str.
* remove device prefix from attributes.
* Use const char * instead of string.
* Move autogpu index out of Device.
* comment on is_default.
* Rename torch.DeviceSpec to torch.device.
* comment.
* Fix tests.
* Fix flake8.
* Fix sparse_coo_tensor parameter name.
* Improve error message.
* Remove device_ prefix from C++ device object.
* Allocate static strings.
* Return not implemented from rich compare.
* Move torch::Device to THPDevice.
* Remove cuda index.
* Py_RETURN_NOTIMPLEMENTED doesn't exist in python2.
This changes type(tensor) to return `torch.Tensor` instead of
`torch.autograd.Variable`.
This requires a few implementation changes:
- torch.Tensor is now a regular Python class instead of a
pseudo-factory like torch.FloatTensor/torch.DoubleTensor
- torch.autograd.Variable is just a shell with a __new__ function.
Since no instanes are constructed it doesn't have any methods.
- Adds torch.get_default_dtype() since torch.Tensor.dtype returns
<attribute 'dtype' of 'torch._C._TensorBase' objects>
We had a bug in the Buck build of PyTorch due to symbols from _C
being present in two shared libraries that were both loaded at
runtime. This caused global variables to be initialized twice and
destructed twice on exit. The second destruction often caused
segfaults on exit.
This attempts to detect that sort of situation early on. If
Module.cpp is compiled twice, the symbol
pytorch_duplicate_guard()::initialized will be shared. The second
initialization will print an error message and abort.
* Introduce torch.layout and split layout from dtypes.
Tensors (and tensor types) now have a 'layout' attribute that returns either 'torch.strided' or 'torch.sparse_coo'.
Previously, dtypes were 1-to-1 with ATen types/PyTensorTypes; the impetus behind this decision was to make things easy in the common case
(i.e. specifying a type in a factory function). But this doesn't really follow for sparity, which isn't a common case.
It also doesn't properly represent the concept or a dtype, which in numpy are proper scalar types (i.e. roughly the type returned from indexing the
last dimension of an n-d array). But this should be the same whether or not the tensor is represented via strides, sparsity, etc.
This is accomplished by:
1) having the dtype of tensor return the (device-type, scalar-type) combination, i.e. torch.cuda.float32, so both
torch.cuda.FloatTensor and torch.cuda.sparse.FloatTensor have the same dtype
2) Adding a layout parameter to python functions, where the combination of (dtype, layout) maps to an ATen type that is used for dispatch.
* Formatting, make init throw python_error.
* Fix cuda not enabled error message.
* Fix test.
This is the first of three PRs that #5537 will be split into.
This PR adds mkl headers to included files, and provides helper functions for MKL fft and cuFFT.
In particular, on POSIX, headers are using mkl-include from conda, and on Windows, it is from a new file @yf225 and I made and uploaded to s3.
* add mkl-include to required packages
* include MKL headers; add AT_MKL_ENABLED flag; add a method to query MKL availability
* Add MKL and CUFFT helpers
- Remove some uses of mega-header THP.h
- Use HANDLE_TH_ERRORS in functions that may throw
- Move NumPy includes to common header
- Delete unused allocator
* Add dtype to torch.Tensor, torch.FloatTensor, etc.
* Support passing dtypes to set_default_tensor_type.
* Check dtype exception.
* Correctly handle new type initialization order.
* Move handling of torch.Storage alias to C++.
* Delete function that erroneously reappeared.
This deletes most of the dead Tensor code paths, including the TensorMethods cwrap and generic/Tensor.cpp.
This also moves the THNN.cwrap/.cpp generation to generate_code which can use ninja if installed.
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.
To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.
There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:
https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
* Add numpy-style dtypes to Variable factories.
1) Add numpy-style dtypes corresponding to torch tensor types. These are:
torch.float16, torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64
as well as torch.cuda, torch.sparse, and torch.cuda.sparse equivalents.
2) Adds "legacy" names for the above dtypes that correspond more closely to existing tensor names. These are:
torch.half, torch.float, torch.double, torch.short, torch.int, torch.long.
torch.byte and torch.char don't exist because they either don't match numpy semantics or differ on different architectures.
3) Adds a "dtype" parameter to Variable factories (e.g. zeros, ones) that allows the user to specify the type without changing the default tensor type.
4) Adds a "dtype" getter to Variables that return the canonical dtype from 1)
This PR is missing the following useful features that should be added in the future:
A) We only add the "dtype" parameter to auto-generated factories; hand-written factories like in tensor_new.cpp don't support this yet.
B) We don't allow type conversions to use dtypes; that should be added to type(param) or a new function.
C) We don't yet have a "device" parameter for these factories; right now, they will only create Variables on the default device.
* backend_to_string can be private.
* Define python binding argument indexes in a more simple way.
* add all_declared_types, still need to hook it up to THPDType.
* Fix all_declared_types for missing types (it's Sparse + Half).
* Ensure cuda dtypes are created even if compiled with NO_CUDA=1.
* Fix case where dtype is provided but dispatch is via namespace.
This happens in ones_like, empty_like, randn_like.
There is some question if we should do:
1) at::ones_like(tensor).toType(dtype)
2) at::ones_like(tensor.toType(dtype))
I did the former because this matches with the numpy documentation, i.e.:
"Overrides the data type of the result." and it's easier to implement.
Note that the above causes an extra copy, either of the input or output.
Here's a better implementation:
1) Make zeros_like, ones_like native functions that take an optional type (named dtype?).
2) Match the type argument with the dtype, so we don't have two different parameters.
3) Call at::zeros_like(input, type) -> at::native::zeros_like(input, type) -> type.zeros(input.sizes())
* Don't return from maybe_initialize_cuda.
* Don't leak DType name.
* Address cpp review comments.
* Share code between sparse and non-sparse test_dtypes.
* Rewrite _like functions as native function with explicit type parameter.
* Use type 'Type' instead of 'dtype' for consistency.
* Address review comments.
* Handle arg_idx when there is requires_grad but no dtype in python_binding_arguments.
* Enable scalars if compiled with WITH_SCALAR environment variable.
We are pretty close to enabling scalars (0-dimensional arrays); this allows turning them on
for development purposes and to be able to write code that works both with and without scalars enabled.
WITH_SCALARS is currently broken with distributions, but should work for test_torch, test_autograd, test_nn.
* Fix unsqueeze.
* Fix wrap dim, wrapping with Scalar.
* Use ATen infer_size implementation rather than TH.
The only substantitive difference between the two implementations is in how empty sizes are handled;
in ATen these are treated as scalars (i.e., can be expanded to anything), whereas in TH they are treated
as a special case of empty tensors (i.e., can't be expanded to anything). Therefore, this change is
necessary to support scalars (0-dimensional tensors). We could also take a bool parameter for determining
how we treat empty tensors but this seems unnecessary: if one tries to expand an empty tensors (as a result
of an infer_size calculation), the expansion will fail.
* Make changes for review.
* Attempt to fix windows build.
* long -> int.
This adds overrides in VariableType for the xxx_out ATen functions and
implements Python bindings. There is no support for automatic
differentiation. If any of the inputs (or outputs) requires grad, then the
function will throw an exception unless it's running in "no-grad" mode.
The bindings for calling torch.xxx functions on Variables are moved to a
different object. Previously, they were static method on VariableBase.
This change prevents users from accidentally calling static methods as if
they were instance methods.
* Fix the inconsistency of `polygamma` on Tensor and Variable.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Regression test for #4466, polygamma works on variables.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
* Add macro IMPLEMENT_STATELESS_SWAP to dispatch stateless methods on Variables correctly.
When call stateless methods with more than one arguments and the `self` comes second,
the `self` argument needs to be swapped to the first position before dispatching.
The macro `IMPLEMENT_STATELESS_ADDXX` is still reserved for deprecated `add**`
methods.
Signed-off-by: HE, Tao <sighingnow@gmail.com>
- Rename THNN convolution to have thnn_ prefix.
- Propagate CuDNN benchmark and deterministic to at::Context
- Add 'convolution', 'convNd' and 'conv_transposeNd' native wrappers, with defaults
The conv_transposeNd wrappers are updated to have the same argument
order as Python.
- torch.nn.functional directly dispatches to the native wrappers
- Make it possible to turn off tracing for some native wrappers, so I don't
have to write symbolics for all the functions above
- Spectral ops can now make use of CuDNN convolution if possible
- Better commentary on cudnn_batch_norm
- Turn on DCE for all JIT tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Refactor cudnn code layout / make build more robust.
When I previously moved cuDNN into ATen, I wasn't too familiar with the
ATen native function directory layout, and so I did a number of
suboptimal things. This commit fixes those problems.
- If NO_CUDA was set but cuDNN is installed on your system, we'd incorrectly
assume that CUDNN was enabled, to hilarious effect.
- We now distinguish between cudnn implementation files and cudnn
native function files. The native files now live in ATen/native/cudnn,
and are *unconditionally compiled*, even when we are not building with cuDNN.
This means that we can unconditionally declare cudnn functions in yaml
and they are always available, even if they are broken. The cuDNN specific
files live in 'cudnn', they are *never* installed, and they are used
purely for implementation purposes. I had to add stub implementations of
all ATen functions to achieve this.
- I had written headers for at::native functions manually, but codegen
will generate them for me automatically. So I deleted the headers.
That lets me get rid of some header install logic as well.
- There's a new note about ATen preprocessor philosophy.
This removes volatile from Variable. The functionality is mostly
replaced by a global (thread-local) flag, which is controlled by
torch.set_grad_enabled() and the context manager torch.no_grad().
In C++, the flag is exposed through GradMode::is_enabled() and GradMode::set_enabled()
Fixes#3627
* Support CPU Apply directly in ATen and implement standard_gamma using it.
Main changes in this PR:
1) Added a TH_APPLY-style templatized function for CPU apply calls (currently only 2 and 3 tensor argument
versions are supported, but more are easy to add). In fact, this is basically identical to TH_APPLY, except
it uses ATen functions and the API is a template instead of a macro. The template takes an operation that
is performed on the data (and an indicator to signal early termination); i.e. you don't need to know that
x_data is a pointer to the current data location of x.
2) Refactors the ATen dispatch code to easily generate dispatch code for different subsets of the scalar types.
This is in preference to the template_scalar path, which requires valid specialization of each scalar type. Valid
specializations are particularly annoying with CUDA because you most likely can't put the specializations
in a header so need to write some sort of for-all-scalar-type macro to get the correct specializations.
Currently, we only generate dispatch_all (all scalar types, the equivalent existed already), and
dispatch_cpu_floating_types (which is used by standard_gamma).
3) Implements standard_gamma using the above changes (this is an arbitrary choice, it was the latest
apply macro to be committed). The forward is bound via Declarations.yaml,
the backward via the Apply template, and then they are hooked together in derivatives.yaml. This eliminates
needing to change TH at all going forward, which means one can write idiomatic C++ instead of the TH-style macros
(e.g. TH_MATH_NAME).
* Generate Dispatch code with nicer spacing.
* Small cleanups.
* Fix typo.
* Add TODOs for changing macros, remove dead code.
* Use a lambda function.
* Get rid of early exit.
* Rename Scalar,ScalarType template parameters to CScalar.
* Reorder _standard_gamma_grad parameters.
* Add comments explaining calling convention.
* Don't generate Dispatch.h anymore.
* Get rid of backend specific checks in dispatch.
* Fix empty/scalar check.
This is not currently used by anything, but eventually ATen
will need to make decisions about whether or not to use
CuDNN functions or not, which means we need to propagate
this variable to ATen.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Bind cauchy_, exponential_, normal_, uniform_ functions to THPVariable.
Also changes the error messages around Generator parser; previously, you'd get an error
like: torch._C.Generator is not a torch.Generator; now the check is proper but returns
that only None is supported.
* Support passing Generators to ATen Variable-bound methods.
This involves changing THPGenerator to have an at::Generator rather than a THGenerator.
TH getRNGState, setRNGState are still called directly because they are not bound from ATen yet;
they should probably be on the Generators and return (opaque) GenerateState objects.
* Fix default values.
* Properly use THRandom_initialSeed.
* update standard gamma to use new default generator.
Implements from_numpy using ATen tensors. Variable.from_numpy is a
convenient placeholder for the variant that returns Variables until we
merge Tensor and Variable.
The behavior is slightly changed:
- from_numpy() on an empty array now returns an empty tensor instead of
throwing an exception. The shape may not be preserved.
- CharTensor(ndarray) used to throw an exception. It now copies the
ndarray. Copying is implemented via ATen toType.
* Comprehensive rewrite of Torch CuDNN bindings / a bit of ATen infra
The executive summary is that this moves the torch/csrc/cudnn
library into ATen, adding a number of new cudnn_ methods to ATen
for batchnorm, convolution, affine grid generator and grid sampler.
ATen infra changes:
- TensorGeometry was moved to ATen
- TensorGeometry was modified to make its interface resemble that of
Tensor; in particular, sizes is no longer a field, it's a method.
- AT_CUDA_ENABLED macro is set via ATen/Config.h header which is
generated at cmake configure time.
Fixes https://github.com/zdevito/ATen/issues/168
- Change AT_CUDA_ENABLED macro to be a function macro, so that we
error if it is not defined
- Introduce a new TensorArg class, which is a Tensor plus a little
metadata. This helps us give good error messages when checking
dimensions/shapes of tensors.
Fixes https://github.com/zdevito/ATen/issues/169
- Also introduce a TensorGeometryArg class, for when you don't
need the actual tensor data (which is most of the time.)
- Add ATen/Check.h, which contains a number of utility functions
for testing shapes, types and devices of input tensors. This
will be particulary useful for native methods, which don't get
code generated input testing code. These functions take a
'CheckedFrom' argument, at the moment just a string, which
specifies some extra information about what function was
doing the actual checking; this greatly improves error messages.
- Many check functions take initializer lists, which let you
test that all tensors have some property. This API is
peculiar, in that we IGNORE undefined tensors in this case.
This is handled by filterDefined.
- Add AT_CUDNN_ENABLED macro
- CuDNN linking from ATen was improved; for example, we now actually
add the CuDNN headers to our include path.
- Add some missing override specifiers to some methods
- We now actually build tests with CUDA functionality accessible
(previously, AT_CUDA_ENABLED was not defined, meaning that
the headers were missing all CUDA-only functionality.)
- Native functions now support giving explicit names to return
outputs in yaml. This makes it possible to hook into the NN
autogenerated derivatives codepath using native functions.
CuDNN rewrite changes:
- torch/csrc/cudnn now uses ATen (rather than passing around
THVoidTensor) and lives in ATen. This lets us remove tensorPointer
shenanigans. The functions are exposed to ATen as native functions
described in aten/src/ATen/cudnn/cuDNN.yaml
- ATen now builds and links against CuDNN when enabled. The cmake
package script was taken from Caffe2.
- Some header reorganization was done to help reduce dependencies
on headers (this reorg is no longer used but I've kept it)
- Rename CHECK to CUDNN_CHECK
- Rip out old shape/type testing code in favor of modern ATen/Check.h
interface using TensorArg. In many cases, increase the robustness of
the checking code.
- Change the inputs of the public facing functions, so that they can
be bound by ATen
- Delete THCState*; this is retrieved from the global ATen context
- Delete cudnnHandle_t, this is retrieved from the global Handles.h
- Delete cudnnDataType_t, this is retrieved from the Tensor type
- Delete Convolution class, instead its constituent arguments are
passed individually
- Change functions to return tensors, rather than take an appropriately
sized output tensor as an input.
- Redo how transposed convolution / backward convolution is implemented
(knock on effect of returning tensors). Previously it was assumed
that you would always pass an appropriately sized output tensor, but
we don't want to do this anymore. For backwards, we instead give
the desired output tensor (input, really) size, because that is
readily available. For *transposed* convolution, however, we take
output_padding, and otherwise do the shape calculation.
- Redo how legacy group convolution is implemented (knock on effect from
porting cudnn to ATen.) Previously, group convolution was implemented
by manually constructing sizes and strides and then outputting
appropriate, with macros switching between individual groups and
all-at-once based on CuDNN version. Now, the code looks exactly what
you'd expect: there's a top-level wrapping function that supports
group convolution no matter the version of CuDNN, and a low-level
wrapper which supports only what CuDNN supports. The top-level
function conditions on CuDNN version, and invokes the low-level
interface 1 or n times.
- There is now a debugging printer for tensor descriptors.
- Convolution struct is replaced with ConvolutionArgs, which is not
part of the public API but is used internally to conveniently
pass around all of the arguments needed for Convolution.
- Add some constexprs for well-known dimensions, reduce amount of
magic numbers in code.
- Put 'deterministic' in to ConvParams. Fixes#3659
- Lots more comments.
- Some pessimizations, in the name of code clarity:
- The descriptors are initialized on every invocation of convolution
forward/backward. Previously, the descriptors were cached, so that
you didn't have to initialize them again on backwards. This is
difficult to support in the ATen interface so I didn't support it.
- Legacy group convolution initializes its workspace for *every* group
it performs. I did not feel motivated to fix this because the
legacy codepath is already quite slow.
- Affine grid generator and grid sampler automatically call contiguous
on their arguments as necessary.
- Batchnorm input checking is greatly beefed up, it now checks for
the following input characteristics:
- Definedness
- GPU location
- Type
- Contiguity
- Size
PyTorch binding code changes
- batchnorm now uses consistent var/data naming
- batchnorm and convolution make use of new ATen bindings
- Affine grid generator and grid sampler make use of ATen CuDNN
bindings via derivatives.yaml. This means I had to restructure
the code a little, since the THNN bindings still go through
a legacy Python class.
- I fixed some warnings:
- s/friend class/friend struct/ on InterpreterStateImpl
- Removed pessimizing move 'detached' in torch/csrc/autograd/variable.cpp
- Removed unused pack_list on Scalar
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
GCC 4.8 buildfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Add TensorGeometry to ATen.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CUDNN_CHECK
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Update TODO comment
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Delete return in cudnn_grid_sampler
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
s/cudnnSetStreamToCurrent/setCuDNNStreamToCurrent/g
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Don't allocate a new vector when filtering defined.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Remove Check overloads, convert to pass references.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Some more microbenchmarking.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add torch.take and Tensor.put_
These are similar to numpy.take and numpy.put. The take function allows
you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices. The put function
copies value into a tensor also using linear indices.
This includes some changes to the dispatch code for torch.xxx functions:
- Since Variable.addmm is an instance-method, the self argument has to
come first. The dispatch code swaps the first two arguments if
necessary to suppor the deprecated signatures where 'alpha' or 'beta'
comes before the 'self' tensor.
- Delete IMPLEMENT_STATELESS_REVERSED. These functions require output
arguments to be passed in using the keyword 'out'. They were meant to
handle torch.gt(out, a, b), but we haven't allowed that for a while.
ATen has it's own default CPU RNG. Use this as the default in PyTorch so
that random functions called through ATen have the same behavior as
random functions called through TensorMethods
Along the way I added converters for Variable and TracingInput. Variable should
probably be moved to a more widely known spot.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Previously, our AST was a DAG, where shared Nodes indicated a computation
should be reused. This commit rewrites the IR into a new functional
representation which represents sharing explicitly using variable
bindings.
We offer a few justifications for this new style:
1. The new representation is not all that different from the
old one; it is about as easy to construct, and the lack of an
explicit graph doesn't negatively impact our ability to interpret
the graph, since we've chosen, as a matter of design, to NOT have
the IR participate in the actual execution of a graph.
2. The new let-binding representation has an implicit ordering,
which we can use to conveniently keep track of the original order
the trace showed up as. This automatically gives us a topsort,
and gives us an easier to read textual representation of our
IR:
%14 = Embedding %11, %0, -1, None, 2, False, False
%15 = Dropout %14, 0.2, True, False
%16 = Index %12, 0
%17 = Index %12, 1
%18 = Index %13, 0
%19 = Index %13, 1
%20 = Index %15, 0
%21 = Linear %20, %1, %3
%22 = Linear %16, %2, %4
3. It moves us closer to a Futhark style language
(http://futhark-lang.org/publications/pldi17.pdf).
Major aspects of the diff
- Node is replaced with Expr and Arg, a pair of mutually recursive
structures which represent our new language. In BNF, the language
looks like this:
a ::= c | %i
e ::= %i, ... = e
| PyOp e, ...
| Ret %i, ...
Technically, Ret is not actually a return (no control flow is involved),
it just tuples up a series of tensors (identified by variables).
One important invariant is that locals are always tensors; they
are never constants (this is asymmetric with Args.)
- Arguments support Python constants. This is an important piece because
many operators take extra Python literals like integers and tuples in
order to specify extra parameters about how an operator operates. Adding
this was essential to getting word_language_model to work.
- As both Expr and Arg have multiple variants, there is new infrastructure
for doing case on the variants using ExprVisitor and ArgVisitor. The
strategy here is adapted from WebAssembly's visitors, although we have
generalized to permit arbitrary argument forwarding, which is necessary
to support tail-recursive visitor calls. TCO is important because our
interpreter may recurse arbitrarily deep into a stack of nested lets.
If users wish, they can also manually case on the type tag.
- Tracing is now turned on and off using _tracer_enter/_tracer_exit in
torch._C. _tracer_enter accepts a list of variables which are to be
treated as arguments; _tracer_exit accepts the list of traced variables
which should be returned when you reexecute the trace, and returns
the trace expression which can be reexecuted. GlobalTracingState
is a global variable which tracks whether or not we are tracing or not.
- You use run_forward to execute a trace on some set of parameters.
- When under tracing, variables keep track, via trace_local, what the
name of their variables in the IR are.
Here is a simple runner which leaks memory but can be used to JIT models:
import torch.autograd.function as F
import torch._C
def jit(model):
import types
real_forward = model.forward
def forward(self, *args):
def flatten(x):
return tuple(F._iter_variables(x))
if not hasattr(self, "saved_trace"):
torch._C._tracer_enter(tuple(self.parameters()) + flatten(args))
out = real_forward(*args)
self.saved_trace = torch._C._tracer_exit(flatten(out))
self.saved_outs = out
return out
else:
flat_out = Variable._execution_engine.run_forward(self.saved_trace, tuple(self.parameters()) + flatten(args))
return F._unflatten(flat_out, self.saved_outs)
Major problems:
- Sanity checking is spotty at best, especially when users pass in variables.
- The interpreter leaks tensor memory from the store. When we add back def-use
we should be able to deallocate tensors as soon as we know they are no longer
necessary.
- The interpreter needs to reach feature parity with the old execution engine.
From there, we need to see if backwards can be subsumed as well.
- I still have no confidence in having memory managed everything correctly.
This requires a close look.
- Rather than return an *open* expression as a trace, we should return a
*lambda* instead, which knows about how many formal parameters it
requires.
- The IR is not introspectable from Python at the moment, but this is simply a
matter of implementing all the binding code.
- The tracer is NOT reentrant (you can't trace while you're inside a trace.)
Furthermore, no sanity checking is done if you try to incorrectly reuse
things from one trace in another.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Setting torch.utils.backcompat.broadcast.warning.enabled=True
will cause Python warnings in the case where broadcast occurs
but previously 1-d view style pointwise ops occured.
Otherwise, on many machines, the size of the OpenMP thread pool will
change between MKL and our OpenMP enabled functions. The constant thread
creation and destruction results in worse performance and leaks memory
on GCC 5.4
The core autograd Variable, Function, and Engine no longer depend on the
Python API. This let's us implement functions in C++. In the future, we
can also multithread engine and release the GIL for most of the
non-Python backwards.
This hooks into the (internal) ForkingPickler class in multiprocessing
to reduce tensors, storages, and CUDA events instead of our queue from
joblib. This makes it easier to use the standard multiprocessing classes
in later versions of Python.
This also exposes:
- Tensor/Storage.share_memory_()
- Module.share_memory()
These methods move the CPU tensors and storages to shared memory. If
you're using the "fork" method of multiprocessing, these objects can be
directly inherited instead of serialized through a queue.
See issue #20
The torch.Size class is a tuple subclass which distinguishes sizes from
other tuples so that torch.Tensor(size) is interpreted as size instead
of data.