Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30642
Adding a couple of basic metrics for distributed autograd which would
help in determining stuckness.
ghstack-source-id: 95156189
Test Plan: waitforbuildbot
Differential Revision: D18776478
fbshipit-source-id: a0556ad6fe2b7c3cd0082ee2350c1c78cafaaec5
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29161.
I looked a bit at the code changes related to this and think I have all of the use cases of `DeprecatedTypeProperties` covered in the message, but suggestions from someone with more context on this would be very much appreciated :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30281
Differential Revision: D18830818
Pulled By: ezyang
fbshipit-source-id: 1a7fcee15354ae09e6644577e7fa33bd26acfe20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377
Test Plan: Added unit tests to test failures.
Differential Revision: D17916844
fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022
This change implements the "FAST" mode distributed autograd backward
pass as described in https://github.com/pytorch/pytorch/issues/23110.
At a high level the backward pass works as follows:
1. We start by computing dependencies on the node that calls
`torch.distributed.backward`.
2. This node computes the dependencies starting from the root nodes provided in
the backward call and all the 'send' functions present in the current autograd
context. The "FAST" mode assumes all 'send' functions are part of the autograd
computation.
3. Once the dependency computation is done, the distributed autograd engine
calls the local autograd engine to execute the autograd graph. Note that the
autograd graph on a single node is not necessarily connected because of
inter-node communication. As a result, we have special handling to ensure the
local autograd engine ensures we execute the entire graph starting from the
provided roots and all 'send' functions on the node.
4. When the local autograd engine hits a 'recv' function, it performs an async
RPC to send the gradients over to the appropriate node and stores a future in
the autograd context to keep track of this RPC.
5. On the destination node, the appropriate 'send' function is looked up and
enqueued on the local autograd engine. If this is the first time the node is
hearing about this autograd context id on the backward pass, then the node
computes dependencies for the local autograd engine.
6. As part of compute dependencies, the distributed autograd engine discovers
all leaf nodes and ensures those are passed as 'outputs' to the local autograd
engine. This avoids running the 'AccumulateGrad' function.
7. The gradients computed for the leaf nodes are then actually accumulated in
`DistAutogradContext` for the appropriate autograd context id.
8. The distributed autograd engine waits for the local autograd engine
to complete and also waits for all the 'Futures' (stored in 4.) for respective
RPCs to finish.
We have made the following changes to the local autograd engine for this
purpose:
1. Expose GraphTask and NodeTask so that the distributed autograd engine can
use them.
2. Expose a `execute_with_graph_task` API which gives the distributed engine
to build a GraphTask and pass it to the local autograd engine.
3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build
a `NodeTask` for a 'send' function and enqueue it on the local autograd engine.
In addition to this a few general improvements:
1. Added a `PropagateGradients` RPC call for the 'recv' function to pass
gradients to the appropriate node during the backward pass.
2. Use IValues as much as possible in serialization for RpcWithAutograd.
3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate
exception instead of just returning the message. This is inline with what most
Future.wait() APIs do.
4. Added a `get_gradients(context_id)` API which allows users to retrieve a map
from Tensor to respective gradient for the provided context_id on the local
node.
ghstack-source-id: 91794926
Test Plan: unit tests.
Differential Revision: D17652615
fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3
Summary:
This PR addresses issue https://github.com/pytorch/pytorch/issues/7601.
Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user.
For example, a model with forward declared like (original example courtesy of ngimel):
```
def forward(self,x):
x0 = x.clone()
torch._C._cuda_setStream(self.stream1._cdata)
y0 = self.fc1(x0)
self.event1.record(stream = torch.cuda.current_stream())
torch._C._cuda_setStream(self.stream2._cdata)
y1 = self.fc2(x)
self.event2.record(stream = torch.cuda.current_stream())
self.stream2.wait_event(self.event1)
return y0 + y1
```
currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously.
The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer.
In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354
Test Plan: Two tests were added specifically for this behavior.
Differential Revision: D17275980
Pulled By: mruberry
fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d
Summary:
Improve handling of mixed-type tensor operations.
This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops).
For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts.
The details of the promotion rules are described here:
https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst
Some specific backwards incompatible examples:
* now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)`
* Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result.
See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR:
https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273
Reviewed By: gchanan
Differential Revision: D16582230
Pulled By: nairbv
fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22397
Test Plan:
Added test for reentrant backwards with checkpoint and a test for a recursive backwards function (which should fail if we run all the reentrant tasks recursively in the same thread) and for testing priority of reentrant tasks.
~~Will add a test for priority of reentrant tasks in future pr.~~
Imported from OSS
Differential Revision: D16131955
fbshipit-source-id: 18301d45c1ec9fbeb566b1016dbaf7a84a09c7ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17991
changes:
-Breaks bc: Tensor::type() now returns DeprecatedTypeProperties& rather than Type&.
-Added DeprecatedTypeProperties, it serves as a temporary replacement for Type as the return value of Tensor::type(). This contributes to making Type just for dispatch purposes so that we can make it dtype agnostic.
-Tensor::dispatch_type() now returns Type& like Tensor::type() used to do.
-Changed callsites of Tensor::type() appropriately.
Reviewed By: ezyang
Differential Revision: D14443117
fbshipit-source-id: 239ccb7a09626279a71d1a37f8f82e7f57bf7d9e
Summary:
Allow the comparison function used in ReadyQueue to handle the empty FunctionTasks created by the reentrant autograd.
Fix#11732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15791
Differential Revision: D13598006
Pulled By: soumith
fbshipit-source-id: 0bfdf28a735fbfe44f0fdbaf8b74a6198e6a1984
Summary:
This PR adds the final set of clang-tidy checks we should add for our codebase: a last set of performance-related checks. Most fixes here are around changing `auto` to `const auto&` in a few places where unnecessary copies were made, and adding `reserve()` calls before loops doing repeated `push_back()`. Also a few cases of calling `std::string::find` with a single-character string literal instead of a single char, which uses a less efficient string search algorithm meant for searching larger substrings.

ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15198
Differential Revision: D13468797
Pulled By: goldsborough
fbshipit-source-id: 2bed1ea1c7c162b7f3e0e1026f17125e88c4d5b2
Summary:
This PR fixes around 250 places in the codebase where we were making unnecessary copies of objects (some large, some small).
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15026
Differential Revision: D13458784
Pulled By: goldsborough
fbshipit-source-id: be5148b2ce09493588d70952e6f6d6ff5ec5199b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248
This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.
Reviewed By: bddppq
Differential Revision: D13145293
fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
Summary:
```
This diff changes the HIPification of ATen to be out-of-place.
We now have the following mappings:
- ATen/cuda => ATen/hip
- ATen/native/cuda => ATen/native/hip
- ATen/native/sparse/cuda => ATen/native/sparse/hip
- THC => THH
- THCUNN => THHUNN
The build system is adjusted to know about these new build paths,
and HIPify is taught how to adjust include paths and
THC_GENERIC_FILE appropriately. ATen_hip is now built as
the ATen_hip library, rather than reusing ATen_cuda.
However, despite these new filepaths, none of the identifiers in ATen
have actually changed. So, e.g., THHGeneral.h still defines functions
named THC_blahblah, and HIP still shows up as CUDA in PyTorch itself.
We'll tackle this in a subsequent PR; this diff is just to get the files
out-of-place.
Minor extra improvements:
- Don't edit tmp_install when hipifying
- HIP no longer builds native_cudnn_cpp; it was unnecessary
- Caffe2_HIP_INCLUDES is now Caffe2_HIP_INCLUDE, for consistency
with all the other variables.
- HIP build now properly respects ATEN_CUDA_FILES_GEN_LIB (it
did not previously.)
- You can now override file extension matching in pyHIPIFY
by explicitly specifying its full name in the matching list.
This is used so we can HIPify CMakeLists.txt in some situations.
A little bit of string and ceiling wax:
- gen.py grows a --rocm flag so that it knows to generate CUDA
files which actually refer to the HIP headers (e.g., THH.h)
We'll get rid of this eventually and generate real HIP files,
but not for this PR.
- Management of HIP dependencies is now completely deleted
from the ATen CMakeLists.txt. The old code was dead (because
it was shoveled in ATen_CUDA_DEPENDENCY_LIBS and promptly
ignored by the Caffe2 build system) and didn't actually work.
```
Stacked on https://github.com/pytorch/pytorch/pull/14849 review last commit only
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14866
Differential Revision: D13419475
Pulled By: ezyang
fbshipit-source-id: cb4c843df69a1d8369314c9fab1b7719520fa3db
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
Previously symbolic AD formulas assumed that no broadcasting happened,
and would return gradients of incorrect shapes (possibly leading to
silent errors later).
Fixes a few bugs (known and unknown):
- #11736
- ArgumentSpec didn't compute the input types correctly [(it didn't advance the offset for non-tensor args)](https://github.com/pytorch/pytorch/pull/14485/files#diff-4fd3157a056596aefb8cdf41022a208bR153)
- Symbolic AD could suffer from use after free (dangling pointers in grad map), because [`EliminateDeadCode` could have removed nodes](https://github.com/pytorch/pytorch/pull/14485/files#diff-25d33ad1ed6855684dec79d927ca6142L781) that referenced gradients of certain values.
- Undefined behavior in `aten::size`
During my tests I've also found a few new problems, and I have opened issues for them:
- FusionGroup seems to think that cat nodes broadcast their inputs (#14483)
- `prim::ConstantChunk` derivative formula doesn't handle undefined inputs (#14484)
This patch unfortunately deoptimizes some of our code (Fusion doesn't happen past chunk nodes, and outputs more tensors only because we have to get their size). I know how to fix those issues, but wanted to fix this terrible bug quickly.
cc zou3519 zdevito ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14485
Reviewed By: eellison
Differential Revision: D13312888
Pulled By: suo
fbshipit-source-id: ad46bfb4d0a306ad9451002f8270f7a790f72d58
Summary:
Previously symbolic AD formulas assumed that no broadcasting happened,
and would return gradients of incorrect shapes (possibly leading to
silent errors later).
Fixes a few bugs (known and unknown):
- #11736
- ArgumentSpec didn't compute the input types correctly [(it didn't advance the offset for non-tensor args)](https://github.com/pytorch/pytorch/pull/14485/files#diff-4fd3157a056596aefb8cdf41022a208bR153)
- Symbolic AD could suffer from use after free (dangling pointers in grad map), because [`EliminateDeadCode` could have removed nodes](https://github.com/pytorch/pytorch/pull/14485/files#diff-25d33ad1ed6855684dec79d927ca6142L781) that referenced gradients of certain values.
- Undefined behavior in `aten::size`
During my tests I've also found a few new problems, and I have opened issues for them:
- FusionGroup seems to think that cat nodes broadcast their inputs (#14483)
- `prim::ConstantChunk` derivative formula doesn't handle undefined inputs (#14484)
This patch unfortunately deoptimizes some of our code (Fusion doesn't happen past chunk nodes, and outputs more tensors only because we have to get their size). I know how to fix those issues, but wanted to fix this terrible bug quickly.
cc zou3519 zdevito ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14485
Differential Revision: D13280899
Pulled By: soumith
fbshipit-source-id: 80cc5ec9331be80e1bb9ddfe85b81c2b997e0b0c
Summary:
Rebased version of https://github.com/pytorch/pytorch/pull/13337.
I don't think the lint errors in the original PR had to do with files I touched, so hopefully the rebase fixes them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14587
Differential Revision: D13277428
Pulled By: soumith
fbshipit-source-id: f04c186b1dd4889b4250597eef87f9e9bf7b2426
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342
This PR introduces a few new concepts:
- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
provide a generic interface for interfacing with device and stream state,
without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
and dynamically dispatched device guard implementations. Dynamic
dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
devices.
- Optional variants of all the aforementioned guards, which are a no-op if
no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
a device on every guard.
There are some subtle semantic changes, which have been thoroughly documented
in the class definition.
BC-breaking changes:
- Move constructor/assignment have been removed from all device guard
implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
'reset_device', because if you switch devices/device types, the stream/device on the
previous device is unset. This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams. Use CUDAStreamGuard
or CUDAMultiStreamGuard as appropriate for your use case.
Reviewed By: dzhulgakov
Differential Revision: D12849620
fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
Summary:
Enables almost all `modernize-*` checks in clang-tidy. This warns against things such as:
- Use of `const std::string&` instead of new-style `std::string` + move,
- Using old-style loops instead of range-for loops,
- Use of raw `new`
- Use of `push_back` instead of `emplace_back`
- Use of `virtual` together with `override` (`override` is sufficient)
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13196
Differential Revision: D12891837
Pulled By: goldsborough
fbshipit-source-id: 4d0f782a09eb391ee718d3d66f74c095ee121c09
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13232
DeviceGuard should be device agnostic, which means that it shouldn't
assume that int64_t means select the CUDA device.
Reviewed By: gchanan
Differential Revision: D10858024
fbshipit-source-id: b40e8337e4046906fd8f83a95e6206367fb29dbe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12792
This is a follow up diff after D10238910.
Only non-codemod change is the removal of ATen/Error.h and ATen/core/Error.h. Other files are basically changing the inclusion path + clang format for inclusion order.
Reviewed By: bddppq
Differential Revision: D10437824
fbshipit-source-id: 7f885f80ab5827468d1351cfb2765d0e3f555a69
Summary:
Linting `torch/csrc/` (non-recursive) and `torch/csrc/autograd` (non-recursive).
Fixed things like:
- `typedef` vs `using`
- Use `.empty()` instead of comparing with empty string/using `.size() == 0`
- Use range for loops instead of old style loops (`modernize-`)
- Remove some `virtual` + `override`
- Replace `stdint.h` with `cstdint`
- Replace `return Type(x, y)` with `return {x, y}`
- Use boolean values (`true`/`false`) instead of numbers (1/0)
- More ...
ezyang apaszke cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11050
Differential Revision: D9597505
Pulled By: goldsborough
fbshipit-source-id: cb0fb4793ade885a8dbf4b10484487b84c64c7f2
Summary:
This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for #8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.
The precise changes are:
- TypeAndShape -> InputMetadata, now includes device()
- Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
- The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
- Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
- (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings
fyi colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9796
Reviewed By: goldsborough
Differential Revision: D9119325
Pulled By: ezyang
fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de
Summary:
More clang tidy cleanups in `torch/csrc`. This time:
1. `hicpp-use-equals-default` recommends `= default` instead of `{}` for constructors/destructors. This is better practice because it expresses the intent better (https://stackoverflow.com/questions/6502828/what-does-default-mean-after-a-class-function-declaration)
2. `readability-inconsistent-declaration-parameter-name` enforces that parameter names in the declaration match parameter names in the definition. This is just generally useful and can prevent confusion and bugs.
Also updated my script a little bit.
apaszke ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9737
Differential Revision: D9069069
Pulled By: goldsborough
fbshipit-source-id: f7b3f3a4eb4c9fadc30425a153566d3b613a41ae
Summary:
```
This adds TensorIterator, a helper class for computing element-wise
operations that's intended to replace the CPU and CUDA apply utils
functions.
CPU kernels are implemented as functions that operate on strided 1-d
tensors compared to CPUApplyUtils which operated individual elements. This
allows the kernels to handle vectorization, while TensorIterator handles
parallelization and non-coalesced dimensions.
GPU kernels continue to operate on elements, but the number of
specializations is reduced. The contiguous case remains the same. The
non-contiguous case uses a single (reduced) shape for all operands and
the fast integer division from THCIntegerDivider. To avoid extra
specializations for indexing with 64-bits, large operations are split
into smaller operations that can be indexed with 32-bits.
Major semantic changes:
- No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by
TensorIterator. The autograd engine performs the reduction assuming
standard broadcasting if the gradient shape does not match the
expected shape. Functions that do not use standard broadcasting rules
should either continue to trace the expand calls or handle the
reduction in their derivative formula.
- Use ONNX v7, which supports broadcasting ops.
Performance impact:
- Small increased fixed overhead (~0.5 us)
- Larger overhead for wrapped numbers (~2.5 us)
- No significant change for ops on contiguous tensors
- Much faster worst-case performance for non-contiguous GPU tensors
- Faster CPU bias addition (~2x)
- Faster GPU bias addition (~30% faster)
Future work:
- Decrease overhead, especially for wrapping numbers in Tensors
- Handle general inter-type operations
- Extend to unary ops and reductions
- Use buffering for compute-bound operations on non-contiguous tensors
(pull in from CPUApplyUtils)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919
Differential Revision: D8677600
Pulled By: colesbury
fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd
* Created TensorOptions
Storing the type in TensorOptions to solve the Variable problem
Created convenience creation functions for TensorOptions and added tests
Converted zeros to TensorOptions
Converted rand to TensorOptions
Fix codegen for TensorOptions and multiple arguments
Put TensorOptions convenience functions into torch namespace too
All factory functions except *_like support TensorOptions
Integrated with recent JIT changes
Support *_like functions
Fix in place modification
Some cleanups and fixes
Support sparse_coo_tensor
Fix bug in Type.cpp
Fix .empty calls in C++ API
Fix bug in Type.cpp
Trying to fix device placement
Make AutoGPU CPU compatible
Remove some auto_gpu.h uses
Fixing some headers
Fix some remaining CUDA/AutoGPU issues
Fix some AutoGPU uses
Fixes to dispatch_tensor_conversion
Reset version of new variables to zero
Implemented parsing device strings
Random fixes to tests
Self review cleanups
flake8
Undo changes to variable.{h,cpp} because they fail on gcc7.2
Add [cuda] tag to tensor_options_cuda.cpp
Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks
Fix linker error in AutoGPU.cpp
Fix bad merge conflict in native_functions.yaml
Fixed caffe2/contrib/aten
Fix new window functions added to TensorFactories.cpp
* Removed torch::TensorOptions
Added code to generate wrapper functions for factory methods
Add implicit constructor from Backend to TensorOptions
Remove Var() from C++ API and use torch:: functions
Use torch:: functions more subtly in C++ API
Make AutoGPU::set_device more exception safe
Check status directly in DynamicCUDAHooksInterface
Rename AutoGPU to DeviceGuard
Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad
remove python_default_init: self.type()
Add back original factory functions, but with deprecation warnings
Disable DeviceGuard for a couple functions in ATen
Remove print statement
Fix DeviceGuard construction from undefined tensor
Fixing CUDA device compiler issues
Moved as many methods as possible into header files
Dont generate python functions for deprecated factories
Remove merge conflict artefact
Fix tensor_options_cuda.cpp
Fix set_requires_grad not being checked
Fix tensor_new.h
TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac
Fix bug in DeviceGuard.h
Missing includes
TEMPORARILY moving a few more methods into .cpp to see if it fixes windows
Fixing linker errors
* Fix up SummaryOps to use new factories
Undo device agnostic behavior of DeviceGuard
Use -1 instead of optional for default device index
Also move DeviceGuard methods into header
Fixes around device index after optional -> int32_t switch
Fix use of DeviceGuard in new_with_tensor_copy
Fix tensor_options.cpp
* Fix Type::copy(
* Remove test_non_float_params from ONNX tests
* Set requires_grad=False in ONNX tests that use ints
* Put layout/dtype/device on Tensor
* Post merge fixes
* Change behavior of DeviceGuard to match AutoGPU
* Fix C++ API integration tests
* Fix flip functions
* Factor python dependency out of interpreter
* Remove NO_PYTHON for the autograd engine
If there is no python bindings, then a default Engine is constructed
the first time it is requested.
If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.
Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.
* Fixing AlexNet test which is skipped in CI
* Add backward() to Tensor and Variable
* Add at:: in front of Tensor
* Trying to not move optional to appease windows?
* Move implementation into cpp file
* Undo some formatting changes