Summary:
Per title.
This PR introduces a global flag that lets pytorch prefer one of the many backend implementations while calling linear algebra functions on GPU.
Usage:
```python
torch.backends.cuda.preferred_linalg_library('cusolver')
```
Available options (str): `'default'`, `'cusolver'`, `'magma'`.
Issue https://github.com/pytorch/pytorch/issues/63992 inspired me to write this PR. No heuristic is perfect on all devices, library versions, matrix shapes, workloads, etc. We can obtain better performance if we can conveniently switch linear algebra backends at runtime.
Performance of linear algebra operators after this PR should be no worse than before. The flag is set to **`'default'`** by default, which makes everything the same as before this PR.
The implementation of this PR is basically following that of https://github.com/pytorch/pytorch/pull/67790.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67980
Reviewed By: mruberry
Differential Revision: D32849457
Pulled By: ngimel
fbshipit-source-id: 679fee7744a03af057995aef06316306073010a6
Summary:
https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = `
rather than making it the default behavior.
CC ngimel ptrblck
stas00 Note that the behavior after the previous PR can be replicated with
`torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946
Reviewed By: zou3519
Differential Revision: D32289896
Pulled By: ngimel
fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
Summary:
NNAPI converter failed with 1 const value and one tensor earlier
Code suggestions from dreiss
Test Plan:
pytest test/test_nnapi.py::TestNNAPI::test_pointwise_binary
Imported from OSS
Reviewed By: anshuljain1
Differential Revision: D28893881
fbshipit-source-id: 59240373fb03c6fdafa4cb2fa4d8408dd20092f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62225
Rewrote the preprocess function for Android NNAPI delegate.
Previously, `preprocess()` called `convert_model_to_nnapi()` using Pybind and returned a NnapiModule that is serialized for mobile. Now, `preprocess()` calls a sub-function of `convert_model_to_nnapi()` and returns several preprocessed items (that were previously components of NnapiModule).
Dictionary returned contains:
"shape_compute_module": torch::jit::Module,
"ser_model": torch::Tensor,
"weights": List[torch.Tensor],
"inp_mem_fmts": List[int],
"out_mem_fmts": List[int]
**Purpose and Future:**
The purpose of these changes are to move more implementation from bytecode and Torchscript to the delegate API, since bytecode is less efficient.
Now, only the shape computation uses bytecode. In the future, shape computation will be moved out of Torchscript as well.
**nnapi_backend_preprocess.cpp:** preprocess implementation
**prepare.py**: refactored a portion of `convert_model_to_nnapi()` to `process_for_nnapi()`, so preprocess can get components of NnapiModule
**Test:**
Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully
ghstack-source-id: 134444190
Test Plan: Ran `python test/test_jit.py TestNnapiBackend` and `python test/test_nnapi.py` on OSS successfully
Reviewed By: raziel
Differential Revision: D29922279
fbshipit-source-id: cadcf8908d8a745dc7abbe286e97d6ead937d4ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61796
We can easily handle nnapi conversion for nhwc inputs
that have 1 channel or H & W are 1
Test Plan:
pytest test/test_nnapi.py::TestNNAPI::test_flatten
Imported from OSS
Reviewed By: saketh-are
Differential Revision: D29827735
fbshipit-source-id: 65dee4b42fceef1b032bf5dd1c4cc6e020d01e14
Summary:
To add serializer for custom ops we can subclass default serializer
and update ADDER_MAP
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61025
Test Plan:
* pytest test/test_nnapi.py::TestNNAPI for current serializer
* Custom serializers to be tested with custom ops
Imported from OSS
Reviewed By: anshuljain1
Differential Revision: D29480745
fbshipit-source-id: 37e3f8de3c97f6c8a486f9879ce11430ea89af34
Summary: As title
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_cat
Reviewed By: anshuljain1
Differential Revision: D29480747
fbshipit-source-id: 161803054ff1a4c2c750fc30a5f0fc6d8a24b2c9
Summary:
Same as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61021
Test Plan: pytest test/test_nnapi.py::TestNNAPI
Reviewed By: anshuljain1
Differential Revision: D29480746
fbshipit-source-id: 7217c8f3a811db8c3c373f3e7ca31caf9502ef22
Summary:
Add support for aten::slice op in the NNAPI model converter
* If start = 0; end = max -> identity
* Flexible shapes can be passed through
* Flexible shapes can't be sliced over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59364
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_slice
Reviewed By: anshuljain1
Differential Revision: D28881039
fbshipit-source-id: 3c1c630ff27b5bba6eda403d87570c61d43ae90e
Summary:
* Add support for aten::detach op in the NNAPI model converter as a no-op
* Also add flexible op support for add_pointwise_simple_unary_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58543
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_detatch
Reviewed By: anshuljain1
Differential Revision: D28531942
fbshipit-source-id: 4387dbbbadd8ce6b690841f3a903e68a380b849d
Summary:
Add support for aten::div op in the NNAPI model converter. Startup time
variable size support isn't supported as shapes go as inputs to NNAPI op
Runtime variable size support to supported soon
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60885
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_flatten
Reviewed By: anshuljain1
Differential Revision: D29451725
fbshipit-source-id: 8902745f7758c8cc88ad4b4ce02b8301ff894bd4
Summary:
Add support for aten::div op in the NNAPI model converter. Add variable
size input test as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58541
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_div
Reviewed By: anshuljain1
Differential Revision: D28531943
fbshipit-source-id: e96342146f6de216f7b88443618edfc54963747c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58540
Add support for aten::to op in the NNAPI model converter for simple
cases like to("cpu"), to("gpu")
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_to
Reviewed By: anshuljain1
Differential Revision: D28531941
fbshipit-source-id: 0c934f7aceaff2669307c3426efe32046d8c44f3
Summary:
Add support for aten::softmax op in the NNAPI model converter with
flexible size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58539
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_softmax
Reviewed By: anshuljain1
Differential Revision: D28531946
fbshipit-source-id: 8633f3e3f7f52795f9866ff16ad0867ea36a19e8
Summary:
Add support for aten::avgpool2d op in the NNAPI model converter with var
size support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58538
Test Plan: pytest test/test_nnapi.py::TestNNAPI::test_avgpool2d
Reviewed By: anshuljain1
Differential Revision: D28531944
fbshipit-source-id: 43ff8c9389365698c282f204042b49c7ec84d824
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57563
Add flexible size support for upsample_nearest2d op in nnapi model conversion
Test Plan:
pytest test/test_nnapi.py
Imported from OSS
Reviewed By: dreiss
Differential Revision: D28200847
fbshipit-source-id: 901fe3f6e68e4c16ece730f3ffa68dc88c6ed6c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57562
Add flexible size support for qadd op in nnapi model conversion
Test Plan:
pytest test/test_nnapi.py
Imported from OSS
Reviewed By: dreiss
Differential Revision: D28200849
fbshipit-source-id: d5b2ea8e9eb8ae405ff2c960f7549cef60bc0991
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57561
Add flexible size support for conv2d op in nnapi model conversion
Test Plan:
pytest test/test_nnapi.py
Imported from OSS
Reviewed By: dreiss
Differential Revision: D28200848
fbshipit-source-id: d94ccf48a3d8453aa8e96c7cac02948c4cd870cc
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48141
~Mypy is complaining about a missing arg in a function call.~
```bash
torch/backends/_nnapi/serializer.py:806: error: Too few arguments for "_do_add_binary" [call-arg]
Found 1 error in 1 file (checked 1140 source files)
```
9392137dbe/torch/backends/_nnapi/serializer.py (L804-L806)
~dreiss, would you mind take a look when you have some cycles to spare and see what would be the appropriated value for `fuse_code` here? Thanks :)~
Edit: https://github.com/pytorch/pytorch/issues/48925 got merged a couple of days ago. The blocking part is now unblocked, and I just pushed the changes to make mypy happy again. This PR is ready for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48142
Reviewed By: ezyang
Differential Revision: D28006249
Pulled By: walterddr
fbshipit-source-id: 5e43eeba7143512a549efaad31541f86718add7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54701
We need NNAPI models to support inputs (and, by extension, intermediate
values and outputs) whose shape is only determined at load time. For
example, a vision models input shape might be dependent on the aspect
ratio of the device camera. While NNAPI has full support for variable
shapes (by setting components of the operand shape to 0), the guidance
we have received is that vendor-provided drivers for real hardware are
not able to support this efficiently. Therefore, we take a hybrid
approach where shapes are calculated at model load time to
semi-dynamically construct our NNAPI model. While this doesn't let us
have truly dynamic input shapes, it does allow us to ensure that the
vendor driver only sees fixed shapes, so we get maximum performance.
In this initial commit, only PReLU supports dynamic shapes. Additional
operators will be converted in separate diffs.
- In order to convert a flexible-shape model, the user supplies inputs
with shapes containing dimensions of size 0 for the flexible
dimensions.
- During conversion, we generate code to compute the shapes of all
intermediates and outputs as a function of the input shapes.
- We no longer run the input model to produce the output templates.
Instead, we generate code to return properly-sized templates, given
the input shapes.
- All of this generated code goes into a "ShapeComputeModule" that is
used by the NnapiModule during initialization.
- The ShapeComputeModule mutates the serialized model to fill in the
computed sizes for each operand. This requires us to change the dtype
for the serialized model to int32, but this should be fine because
everything in it is already 4-byte aligned.
- NnapiInitWrapper no longer exists. Instead, initialization is
performed on the first run, based on the real arguments. We plan to
provide an API for doing eager initialization.
- Unit test updated to allow separate arguments to be given for trace,
conversion, and inference. A flexible-shape test case was added for
PReLU.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536796
Pulled By: dreiss
fbshipit-source-id: 105585f247987b1e6ec6946a6fe44401237cb0a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54700
This is an internal method just to make it more clear what
len(self.operands) is doing.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536794
Pulled By: dreiss
fbshipit-source-id: 678cee8a47df6757dd2e6feabf2560fd82d32e26
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54699
We'll soon be adding support for flexible-size tensors to the NNAPI
converter, but it won't be added to all ops at once. Create
get_tensor_operand_by_jitval_fixed_size as a wrapper for
get_tensor_operand_by_jitval that verifies that the argument has a fixed
shape. Update all call sites. As flexible size support is added to
each op, the call sites can be converted back and proper size checks
added.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536791
Pulled By: dreiss
fbshipit-source-id: 6fb1fea814d767b6ff263fd8b88240a51be74777
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54698
"mf" was short for memory format, but the concept that this variable
represents was renamed to "dim_order", so rename the variable.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536793
Pulled By: dreiss
fbshipit-source-id: 2b31c70da1ff221a7833e67486690fa606f01dea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54697
Previously, models being converted to NNAPI were expected to take inputs
as separate arguments, but the generated NNAPI model could only take
multiple inputs as a list. Now the generated model always takes inputs
(single or multiple) as separate tensor arguments.
Previously, models being converted to NNAPI were expected to return
outputs as a single tensor or tuple of tensors, but the generated NNAPI
model would return multiple outputs as a list. Now the generated model
returns a tuple as well (or single tensor).
Internally, we decied what output format to use (single tensor or tuple)
based on the conversion process, rather than by running the model.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536790
Pulled By: dreiss
fbshipit-source-id: c0f93c85d450757e568985947cc2f32043795859
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54696
This was originally developed for a Python version where array was not
available.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536792
Pulled By: dreiss
fbshipit-source-id: 39e5507e37d4f91871113439fe752a4d5373eaba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48812
This came up in a squeeze-and-excitation model. Starting with an NHWC
tensor T, we perform a mean operation across H and W, giving an NxC
tensor, which (after some fully connected layers) is reshaped to
NxCx1x1, then multiplied with T. To handle this, we detect the specific
case of a binary op with one NHWC input and one contiguous input with
H,W == 1,1 and allow the op to be applied (after transposing the
contiguous input).
Test Plan: Unit test.
Reviewed By: axitkhurana
Differential Revision: D25317939
Pulled By: dreiss
fbshipit-source-id: b4c17ab3b874d1a7defa04664010ba82115f1c20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54695
Previously, torch.nn.Linear was calling aten::addmm internally. Now
it's calling aten::linear, so add support for that.
Test Plan: Unit test
Reviewed By: axitkhurana
Differential Revision: D27536795
Pulled By: dreiss
fbshipit-source-id: 42c8d2a80b20ac12ed9bba599c5e0e874256bb13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47518
This was left over from an old version of the code. The idea was that
instead of indexing into separate tensors for each weight, you could
bundle them all into a single file and use different offsets into that
file. With the current design, this is nontrivial to support, so drop
the code for now.
Test Plan: CI
Reviewed By: axitkhurana
Differential Revision: D25317935
Pulled By: dreiss
fbshipit-source-id: e26ab3a8d437cb1bbb50319209fa56d9c571ce61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47517
While we're unlikely to see this in practice, it comes up in unit tests.
This type annotation is necessary for `torch.jit.script` to figure out
the type of the list if it is empty.
Test Plan: Unit tests in a later diff.
Reviewed By: axitkhurana
Differential Revision: D25317937
Pulled By: dreiss
fbshipit-source-id: de8b6665c6fcd3cd2b39e3c696a39336c064e4c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46780
This is in prototype status, but pretty functional. There are two major
parts.
- Model converter. This is a pure Python component that consumes a
model in TorchScript format, converts the operations into NNAPI
semantics, and serializes the model in a custom format. It then wraps
the result in a new TorchScript model that can invoke NNAPI under the
hood.
- Runtime. This is a TorchBind object that deserializes the model and
sends the result to NNAPI. This is fairly simple since the serialized
format is basically just a list of NNAPI calls to make, so most of the
code is spent on bounds checking.
A few notes on the design.
- Currently, all tensor sizes need to be fixed, and those fixed sizes
are burned directly into the serialized model. This will probably
need to change. NNAPI supports variable-sized tensors, but the
important hardware backends do not. However, we're seeing use cases
crop up where the input size is not known until around the time that
the model is loaded (for example, it might depend on the camera aspect
ratio). I think the proper fix here is to remove the code in the
converter that eagerly calculates the sizes of the intermediate
tensors and replace it with a code generator that will generate some
TorchScript code that will perform those calculations at model load
time. This way, we will be able to support models that have
variable-sized inputs while still only showing fixed-sized operands to
NNAPI.
- The important hardware backends want operands to be in NHWC order, but
PyTorch natively represents all tensors and NCHW. The strategy for
this is to keep NCHW during most of the conversion process, but track
and additional value per operand representing the "dimension order".
The dimension order gets propagated through convolutions and pointwise
ops. When we're ready to serialize the model, we reorder the
dimensions for "channels last" operands to NHWC.
Test Plan:
Some local testing with FB prod models. I'll need to add some examples
and automated tests.
Reviewed By: iseeyuan
Differential Revision: D24574040
Pulled By: dreiss
fbshipit-source-id: 6adc8571b234877ee3666ec0c0de24da35c38a1f
Summary:
Reland of https://github.com/pytorch/pytorch/issues/38140. It got reverted since it broke slow tests which were only run on master branch(thanks mruberry !). Enabling all CI tests in this PR to make sure they pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38288
Reviewed By: mruberry
Differential Revision: D21524923
Pulled By: ailzhang
fbshipit-source-id: 3a9ecc7461781066499c677249112434b08d2783
Summary:
I'm mostly done with cleaning up test/ folder. There're a bunch of remaining callsites but they're "valid" in testing `type()` functionalities. We cannot remove them until it's fully deprecated.
Next PR would mainly focus on move some callsites to an internal API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38140
Differential Revision: D21483808
Pulled By: ailzhang
fbshipit-source-id: 12f5de6151bae59374cfa0372e827651de7e1c0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34047
This PR integrates the added xnnpack conv2d and linear op via
custom class registration for packed weights. The packed struct
is serializable.
Test Plan:
python test test/test_xnnpack_integration.py
Imported from OSS
Differential Revision: D20185657
fbshipit-source-id: fc7e692d8f913e493b293b02d92f4e78536d7698
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26620
This change updates torch.backend.quantized.engine to accept string ("fbgemm"/"qnnpack"/"none" for now).
set_qengine and get_qengine return an int which represents the at::QEngine enum
Test Plan:
python test/test_torch.py
Imported from OSS
Differential Revision: D17533582
fbshipit-source-id: 5103263d0d59ff37d43dec27243cb76ba8ba633f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25680
Add a runtime flag to choose between FBGEMM and QNNPACK when compiled with both.
The flag can be set by using torch.backends.quantized.engine = torch.fbgemm/torch.qnnpack or ctx::setPreferredQuantizedEngine(at::QEngine)
ghstack-source-id: 89935643
Test Plan: Verified torch.backends.quantized.engine works
Differential Revision: D17198233
fbshipit-source-id: e5449d06f4136385e0e6d18bd4237f8654a61672
Summary:
This PR is about add torch.backends.mkldnn.enabled flag said in https://github.com/pytorch/pytorch/issues/25186 which can be used disable mkldnn at runtime step as torch.backends.cudnn.enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25459
Differential Revision: D17258926
Pulled By: ezyang
fbshipit-source-id: e179ad364cc608fdaa7d0f37e2e762ceb5eda598
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18362
ghimport-source-id: 374b7ab97e2d6a894368007133201f510539296f
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18242 Test running a CUDA build on CPU machine.
* **#18362 Add ability to query if built with CUDA and MKL-DNN.**
Fixes#18108.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14584430
fbshipit-source-id: 7605a1ac4e8f2a7c70d52e5a43ad7f03f0457473
Summary:
This is used commonly in `nn` functions. This PR adds it as a weak
module (and also alters the conversion of weak modules to strong modules
to accept ordinary `object`s)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13057
Differential Revision: D10846618
Pulled By: driazati
fbshipit-source-id: 028b9f852d40e2e53ee85b93282c98cef8cd336b
Summary:
The goal of this PR was to add support for dropout descriptors in the C++ API's RNN class.
The end result is a 4x-5x speedup for our RNN integration tests since they can now use cuDNN instead of autograd when dropout is set.
To achieve this, I had to move `_cudnn_init_dropout_state` to the `TensorOptions` API.
I also fixed a bug around `RNN::cuda()` not flattening parameters for cuDNN.
ebetica ezyang
Closes https://github.com/pytorch/pytorch/pull/9012
Reviewed By: pjh5
Differential Revision: D8689786
Pulled By: goldsborough
fbshipit-source-id: 44fb191f5a38e41c4ded5417306b5bbc012cd56c
* cache cufft plans
* use an LRU cache
* suffix CuFFTParams members with _
* import print_function for py2
* lint
* fix potential race; add dummy impl for CPU only builds
* cpp formatting; remove nccl makefile change
* Use CUDA hooks instead
* comments and doc
* update the error message
* move LRU cachae to a separate file and native::detail namespace
* update comment
* specify NOTE location in CuFFTPlanCache.h
* update disabled_features.yaml to make amd ci work
* another fix for AMD CI in disabled_features.yaml
* Wrap cufft_plan_cache_* methods in __HIP_PLATFORM_HCC__
* improve the notes
* lint
* revert onnx change
* put back inlining for CUFFT_CHECK
* Split libATen.so into libATen_cpu.so and libATen_cuda.so
Previously, ATen could be built with either CPU-only support, or
CPU/CUDA support, but only via a compile-time flag, requiring
two separate builds. This means that if you have a program which
indirectly uses a CPU-only build of ATen, and a CPU/CUDA-build of
ATen, you're gonna have a bad time. And you might want a CPU-only
build of ATen, because it is 15M (versus the 300M of a CUDA build).
This commit splits libATen.so into two libraries, CPU/CUDA, so
that it's not necessary to do a full rebuild to get CPU-only
support; instead, if you link against libATen_cpu.so only, you
are CPU-only; if you additionally link/dlopen libATen_cuda.so,
this enables CUDA support. This brings ATen's dynamic library
structure more similar to Caffe2's. libATen.so is no more
(this is BC BREAKING)
The general principle for how this works is that we introduce
a *hooks* interface, which introduces a dynamic dispatch indirection
between a call site and implementation site of CUDA functionality,
mediated by a static initialization registry. This means that we can continue
to, for example, lazily initialize CUDA from Context (a core, CPU class) without
having a direct dependency on the CUDA bits. Instead, we look up
in the registry if, e.g., CUDA hooks have been loaded (this loading
process happens at static initialization time), and if they
have been we dynamic dispatch to this class. We similarly use
the hooks interface to handle Variable registration.
We introduce a new invariant: if the backend of a type has not
been initialized (e.g., it's library has not been dlopened; for
CUDA, this also includes CUDA initialization), then the Type
pointers in the context registry are NULL. If you access the
registry directly you must maintain this invariant.
There are a few potholes along the way. I document them here:
- Previously, PyTorch maintained a separate registry for variable
types, because no provision for them was made in the Context's
type_registry. Now that we have the hooks mechanism, we can easily
have PyTorch register variables in the main registry. The code
has been refactored accordingly.
- There is a subtle ordering issue between Variable and CUDA.
We permit libATen_cuda.so and PyTorch to be loaded in either
order (in practice, CUDA is always loaded "after" PyTorch, because
it is lazily initialized.) This means that, when CUDA types are
loaded, we must subsequently also initialize their Variable equivalents.
Appropriate hooks were added to VariableHooks to make this possible;
similarly, getVariableHooks() is not referentially transparent, and
will change behavior after Variables are loaded. (This is different
to CUDAHooks, which is "burned in" after you try to initialize CUDA.)
- The cmake is adjusted to separate dependencies into either CPU
or CUDA dependencies. The generator scripts are adjusted to either
generate a file as a CUDA (cuda_file_manager) or CPU file (file_manager).
- I changed all native functions which were CUDA-only (the cudnn functions)
to have dispatches for CUDA only (making it permissible to not specify
all dispatch options.) This uncovered a bug in how we were handling
native functions which dispatch on a Type argument; I introduced a new
self_ty keyword to handle this case. I'm not 100% happy about it
but it fixed my problem.
This also exposed the fact that set_history incompletely handles
heterogenous return tuples combining Tensor and TensorList. I
swapped this codegen to use flatten() (at the possible cost of
a slight perf regression, since we're allocating another vector now
in this code path).
- thc_state is no longer a public member of Context; use getTHCState() instead
- This PR comes with Registry from Caffe2, for handling static initialization.
I needed to make a bunch of fixes to Registry to make it more portable
- No more ##__VA_ARGS__ token pasting; instead, it is mandatory to pass at
least one argument to the var-args. CUDAHooks and VariableHooks pass a nullary
struct CUDAHooksArgs/VariableHooksArgs to solve the problem. We must get rid of
token pasting because it does not work with MSVC.
- It seems MSVC is not willing to generate code for constructors of template
classes at use sites which cross DLL boundaries. So we explicitly instantiate
the class to get around the problem. This involved tweaks to the boilerplate
generating macros, and also required us to shuffle around namespaces a bit,
because you can't specialize a template unless you are in the same namespace as
the template.
- Insertion of AT_API to appropriate places where the registry must be exported
- We have a general problem which is that on recent Ubuntu distributions,
--as-needed is enabled for shared libraries, which is (cc @apaszke who was
worrying about this in #7160 see also #7160 (comment)). For now, I've hacked
this up in the PR to pass -Wl,--no-as-needed to all of the spots necessary to
make CI work, but a more sustainable solution is to attempt to dlopen
libATen_cuda.so when CUDA functionality is requested.
- The JIT tests somehow manage to try to touch CUDA without loading libATen_cuda.so. So
we pass -Wl,--no-as-needed when linking libATen_cuda.so to _C.so
- There is a very subtle linking issue with lapack, which is solved by making sure libATen_cuda.so links against LAPACK. There's a comment in aten/src/ATen/CMakeLists.txt about htis as well as a follow up bug at #7353
- autogradpp used AT_CUDA_ENABLED directly. We've expunged these uses and added
a few more things to CUDAHooks (getNumGPUs)
- Added manualSeedAll to Generator so that we can invoke it polymorphically (it
only does something different for CUDAGenerator)
- There's a new cuda/CUDAConfig.h header for CUDA-only ifdef macros (AT_CUDNN_ENABLED, most prominently)
- CUDAHooks/VariableHooks structs live in at namespace because Registry's
namespace support is not good enough to handle it otherwise (see Registry
changes above)
- There's some modest moving around of native functions in ReduceOps and
UnaryOps to get the CUDA-only function implementations into separate files, so
they are only compiled into libATen_cuda.so. sspaddmm needed a separate CUDA
function due to object linkage boundaries.
- Some direct uses of native functions in CUDA code has to go away, since these
functions are not exported, so you have to go through the dispatcher
(at::native::empty_like to at::empty_like)
- Code in THC/THCS/THCUNN now properly use THC_API macro instead of TH_API
(which matters now that TH and THC are not in the same library)
- Added code debt in torch/_thnn/utils.py and other THNN parsing code to handle
both TH_API and THC_API
- TensorUtils.h is now properly exported with AT_API
- Dead uses of TH_EXPORTS and co expunged; we now use ATen_cpu_exports and
ATen_cuda_exports (new, in ATenCUDAGeneral.h) consistently
- Fix some incorrect type annotations on _cudnn_rnn_backward, where we didn't
declare a type as possibly undefined when we should have. We didn't catch this
previously because optional annotations are not tested on "pass-through" native
ATen ops (which don't have dispatch). Upstream issue at #7316
- There's a new cmake macro aten_compile_options for applying all of our
per-target compile time options. We use this on the cpu and cuda libraries.
- test/test_cpp_extensions.py can be run directly by invoking in Python,
assuming you've setup your PYTHONPATH setup correctly
- type_from_string does some new funny business to only query for all valid CUDA
types (which causes CUDA initialization) when we see "torch.cuda." in the
requested string
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Last mile libtorch fixes
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* pedantic fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Codemod to update our codebase to 0.4 standard
* Update some of the test scri[ts
* remove Variable in test_clip_grad_value
* fix _symbolic_override_wrapper_maker
* Separate cuda-ness from dtype.
There are no longer torch.cuda.int64, etc; only torch.int64 that correspond to at::ScalarType.
At the python arg parser level, the corresponding ATen type is selected from the combination of (ScalarType, Layout, Device).
There is also currently unused code in here for support ScalarType in native_functions; this will be used for specifying aggregate types
on reduction functions.
* Fix test_autograd.
* Add defaults to randint_like.
* Track is_cuda in py tensor types.
* Fix test_sparse.
* Fix multiprocessing.
* Fix rnn.
* Fix test_nn.
* Fix flake8.
This is the first of three PRs that #5537 will be split into.
This PR adds mkl headers to included files, and provides helper functions for MKL fft and cuFFT.
In particular, on POSIX, headers are using mkl-include from conda, and on Windows, it is from a new file @yf225 and I made and uploaded to s3.
* add mkl-include to required packages
* include MKL headers; add AT_MKL_ENABLED flag; add a method to query MKL availability
* Add MKL and CUFFT helpers
* Support native namespace functions with type dispatch.
Use 'ones' as an example. Note this is a "halfway" solution; i.e. the call chain is:
at::ones(shape, dtype) -> dtype.ones(shape, dtype) -> CPUFloatType.ones(shape, dtype) -> at::native::ones(shape, dtype)
The "nicer" solution would probably be something like:
at::ones(shape, dtype) -> dtype.ones(shape) -> CPUFloatType.ones(shape) -> at::native::ones(shape, this)
* Fix type inference.
* Fix test install.
* Fix extensions.
* Put dtype argument at the beginning.
* Fix extension.cpp.
* Fix rnn.
* Move zeros in the same manner.
* Fix cuda.
* Change randn.
* Change rand.
* Change randperm.
* Fix aten contrib.
* Resize in randperm_out.
* Implement eye.
* Fix sparse zeros.
* linspace, logspace.
* arange.
* range.
* Remove type dispatch from gen_python_functions.
* Properly generate maybe_init_cuda for type dispatch functions not named type.
* Don't duplicate dtype, this parameters for native type dispatched functions.
* Call VariableType factory methods from the base type so it gets version number 0.
* Address review comments.
* Port cuDNN RNN dropout state initialization to ATen and make Python code use it.
Fixes#5138.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Variable/Tensor bugfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
The Tensor and Variable classes are being merged.
autograd.Function.forward is now called on Variables, but with "no-grad"
mode (torch.no_grad()) enabled.
One benefit is that we no longer have to explicitly track shared
storages.
* Add transpose() to TensorGeometry.
This code is dead; I briefly used it in my RNN patchset but
eventually rewrote it to not be necessary. However, it seemed
like a useful gadget so I kept it. In general, it seems that it
would be useful for TensorGeometry to support all operations that
Tensor does, but it only computes the changes to sizes/strides
instead of actually doing the computation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Turn on wrap_dim behavior for TensorGeometry
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support for hard-coded differentiable outputs.
Some outputs of functions are nondifferentiable, and should always
be returned with requires_grad=False. Traditionally, we have used
the presence of 'grad' to signal that only the first output is
differentiable, and the rest are not, but cudnn_rnn (to be
implemented) breaks this pattern; its first three outputs are differentiable,
but its last output is a buffer that is just consumed by backwards.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* TensorGeometry constructor from just sizes
The sizes are assumed to form a contiguous tensor, and we compute
the strides we would get in that case.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support saving TensorList for backwards.
There is some back story here. Saved TensorList in backwards will
be used by cudnn_rnn, and it is worth asking, why is it necessary to
save a list of tensors? Indeed, *technically* speaking a list of
tensors is not necessary, we only need to save the sizes of each
of the weight tensors. (We need the sizes because cuDNN is only
going to blast the derivative of weights into a flat buffer, but
we need to match the sizes of the views into the buffer when we
eventually return the derivatives.)
However, it was surprisingly awful trying to implement passing just
sizes, because as non-Tensor arguments, the JIT interpreter generation
code is expected to handle all non-Tensor arguments as attributes in the
trace, and our attributes struct doesn't actually know how to do
arrays of arrays. Saved TensorList code was much easier to get working,
so that's what this patch does.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* MatrixRef - an ArrayRef with a stride, making it a 2D ArrayRef.
Like ArrayRef, this class does not own the underlying data, it is expected
to be used in situations where the data resides in some other buffer.
This is intended to be trivially copyable, so it should be passed by
value.
For now, 2D only (so the copies are actually cheap, without having
to write a SmallVector class) and contiguous only (so we can
return non-strided ArrayRef on index).
The intended use-case (not in this commit) is to make it easier to
work with RNN weights, which are num_weights x num_layers matrix of
parameters.
P.S. dimension 0 indexes rows, dimension 1 indexes columns
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Generalize getDataType in Descriptors.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Change copy_range to take Tensor, and change cat_tensors_backward accordingly
Should a backward function return a Variable or a Tensor? For the most
part, all of our backward functions return Tensor, except cat_tensors_backward,
which returns a variable_list (which is really the only thing that matters,
because Tensor and Variable are interconvertible). But this is kind of weird,
because it means that you can't implement a backwards in ATen that returns
a std::vector<Tensor>, and then hook it up transparently with the derivatives
code. So I switched it over.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support 5-ary return Tensor tuple.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support code generation with mixed Tensor/TensorList in output.
I don't think I ended up using this in cudnn_rnn, but this seems
it might be useful for someone else later.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Support 4-ary boolean array
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add support for retain_variables in tools/autograd/derivatives.yaml
'retain_variables', a bool which is true if a user has specified
that saved variables should be retained in case the backwards is
run again later. This allows an optimization where we can
destroy saved buffers if we know variables are not going to be retained,
e.g., it is (will be) used by _cudnn_rnn
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Lazily initialize cuDNN descriptors
Previously, cuDNN descriptors were eagerly allocated as soon
as a FooDescriptor object was created. However, in some uses
of TensorDescriptor, this is problematic: some tensors are optional
and cuDNN's API expects to be given a nullptr TensorDescriptor
in this case, not an uninitialized (but allocated) descriptor.
Lazily initializing the descriptors makes it less likely for
us to use uninitialized memory and matches the usual semantics of
unique_ptr. It's good sense!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Port cuDNN RNNs to ATen.
This brings three new functions:
- _cudnn_rnn_flatten_weight: flatten a matrix of weight tensors into
a single contiguous weight buffer as required by cuDNN
- _cudnn_rnn: run RNN forwards
- _cudnn_rnn_backward: run RNN backwards
RNNs have a lot of parameters, so we restructured what was previously
a single 'fn' object that recorded all the parameters into three
objects: RNNDescriptorParams, TensorDescriptorListParams and
DropoutDescriptorParams.
We make use of MatrixRef to organize the weight tensors (which are
weight/bias x number of layers), but I did not teach the codegen
how to pass these as arguments/return values natively, so instead
a MatrixRef is passed as its constituent ArrayRef and int64_t stride0.
cudnn_rnn has three differentiable outputs and one nondifferentiable
one, so it makes use of the support for hard-coded differentiable outputs.
I haven't deleted all of the descriptor code from Python, because dropout
initialization still goes through this codepath, that should be fixed soon
but I don't see it as essential for this PR.
This commit also removes the last use of NestedIOFunction from PyTorch.
There are some shenanigans with cuDNN dropout descriptor initialization,
see below:
Note [cuDNN dropout descriptor initialization]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In most cases, setting descriptors in cuDNN is cheap (e.g.,
cudnnSetTensorNdDescriptor). However, this is not the case for
cudnnSetDropoutDescriptor: in cuDNN 6/7 (and possibly others) it does an
expensive precomputation to initialize the random number generator states. In
cuDNN 6, this is the ONLY official mechanism to initialize a dropout descriptor,
which means that law-abiding clients were expected to generate a dropout
descriptor once and cache it. However, our ATen interface is (1) stateless (so
we can't cache the descriptors) and (2) does not accept arbitrary user types in
its interface (so we can't pass the descriptor in). This puts us in a pickle.
In cuDNN 7, a new function, cudnnRestoreDropoutDescriptor was added, which
forgoes the expensive initialization process, and can initialize the
descriptor with a pre-initialized state CUDA tensor. This is great, because
it means we can simply pass in the state tensor and then initialize the
descriptor internally. Unfortunately, this function is not available in
cuDNN 6.
To work around this, we break the cuDNN abstraction barrier, and have
the struct layout of the underlaying dropout descriptor. With this struct,
we can reimplement cudnnRestoreDropoutDescriptor from scratch. Great!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix cuDNN 7 behavior.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Delete some unused, controversial methods from MatrixRef.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add missing filter_dim_a slice
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Replace nested for-loop with itertools.chain.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR comment on mut_desc()
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Refactor DropoutDescriptor API.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Use cached CurrentDeviceProperties from Context.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Document _cudnn_rnn outputs.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Improve fmap docs, convert some functions to use it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Move IndexRange to autograd/function.h
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Elaborate on CUDNN_STATUS_INVALID_VALUE return some more.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add an all-in-one setter for RNNDescriptorParams.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Print what the unrecognized RNN mode was
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* RNN TensorDescriptor improvements
- Have an explicit size/stride overload for set TensorDescriptor,
so you don't have to create a goofy view to feed in.
- Change the padding to 3D rather than 5D, which is all you actually
need (it's just 2D that is not supported by cuDNN API.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix implementation of cudnnRestoreDropoutDescriptor, plus test.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Better comments about input layout.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add comment about no-DropoutDescriptor argument RNNDescriptor function.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rename vocab_size back to input_size.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Don't use backslash in comment.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Bugfix for contiguous TensorGeometry calculation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Don't allocate a dummy tensor when setting TensorDescriptor for flatten_weight.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Make contiguity errors more user-friendly.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* s/fn.dropout.train/fn_train/
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* s/_cudnn_rnn_backward_grad/_cudnn_rnn_backward_input/
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Make dcx properly undefined when not required.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove old TODO.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add state size check in cudnnRestoreDropoutDescriptor
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Explicitly narrow int64_t to size_t
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Restore copyParams comment.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update benchmark numbers, and slight engineering improvements.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Typofix.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Three stage plan to no more stupidly weird "why isn't cuDNN enabled"
bugs:
- Add torch.backends.cudnn.disable_global_flags(), which as its name suggests,
disables global flag setting in cuDNN, so that you are not allowed to
make changes to this state. However, the flags() context
manager continues to work (since they are non-global changes).
- Call disable_global_flags() in test/common.py
- Switch all of the manual flag setting/unsetting in test/test_nn.py
to use the context manager.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- Rename THNN convolution to have thnn_ prefix.
- Propagate CuDNN benchmark and deterministic to at::Context
- Add 'convolution', 'convNd' and 'conv_transposeNd' native wrappers, with defaults
The conv_transposeNd wrappers are updated to have the same argument
order as Python.
- torch.nn.functional directly dispatches to the native wrappers
- Make it possible to turn off tracing for some native wrappers, so I don't
have to write symbolics for all the functions above
- Spectral ops can now make use of CuDNN convolution if possible
- Better commentary on cudnn_batch_norm
- Turn on DCE for all JIT tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This is not currently used by anything, but eventually ATen
will need to make decisions about whether or not to use
CuDNN functions or not, which means we need to propagate
this variable to ATen.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Check cuDNN version at runtime
This checks that the version from cudnn.h matches the version from
libcudnn.so.
Fixes#1476
* Only check major and minor version numbers
This ensures that we use the same library at the C++ level and with
Python ctypes. It moves the searching for the correct library from
run-time to compile-time.
Here's the command I used to invoke autopep8 (in parallel!):
git ls-files | grep '\.py$' | xargs -n1 -P`nproc` autopep8 -i
Several rules are ignored in setup.cfg. The goal is to let autopep8
handle everything which it can handle safely, and to disable any rules
which are tricky or controversial to address. We may want to come back
and re-enable some of these rules later, but I'm trying to make this
patch as safe as possible.
Also configures flake8 to match pep8's behavior.
Also configures TravisCI to check the whole project for lint.