Commit Graph

24 Commits

Author SHA1 Message Date
Pearu Peterson
77f1ca0065 Replace SparseCsrTensorImpl crow/col_indices with compressed/plain_indices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77504

Approved by: https://github.com/cpuhrsch
2022-05-16 18:51:32 +00:00
Michael Carilli
7f2592195d Adds stream recording for cross-stream uses of gradients in streaming backward (#60230)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33909.

I _think_ the two recordDataPtrOnStreams i added are necessary and sufficient. They're the ones that worked for dmitrivainbrand's intricate multistream pipelining in https://github.com/pytorch/pytorch/issues/33909 and I can more or less convince myself they're enough, but it's hard to be sure (and hard to test).

PRing without a test now for visibility. I'll try to come up with something.

input_buffer.cpp needs to compile in cuda or cpu-only builds, so I can't call `c10::cuda::CUDACachingAllocator::recordStream` directly. I planned to work around by adding a binding in VirtualGuardImpl but https://github.com/pytorch/pytorch/pull/57047 spared me the trouble, thanks lw .

Recording a usage stream on a generic tensor was uglier than I expected, see https://github.com/pytorch/pytorch/issues/60306. Up to you guys if adding a unified way to record streams on a tensor backed by any TensorImpl should block this PR (and if so, whether it should happen in a separate PR or as part of this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60230

Reviewed By: mrshenli

Differential Revision: D29289392

Pulled By: albanD

fbshipit-source-id: 1339d382b7d238a461b082597b3962847b5201fe
2021-06-22 12:16:07 -07:00
albanD
e6110d4d5d Fix input_buffer check if inplace update is valid (#59817)
Summary:
Fixes an issue introduced in  https://github.com/pytorch/pytorch/issues/17182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59817

Reviewed By: bdhirsh

Differential Revision: D29040738

Pulled By: albanD

fbshipit-source-id: 67fd4e9fa0dadf507ddd954d20e119d8781c4de0
2021-06-11 07:29:03 -07:00
Wanchao Liang
618104185b [autograd] enable graph level thread parallelism on CPU (#33157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33157

This PR enables graph level thread parallelism on CPU for the Autograd
Engine. It replace https://github.com/pytorch/pytorch/pull/29574 for the
reason of task level parallelism drawbacks with the existing autograd
system.

Fixes https://github.com/pytorch/pytorch/issues/18333

The graph level parallelism on CPU design:

1. Remove the single CPU thread that init in the Engine itself and allow
   the owning thread (which calls Engine::execute) to drive the Engine
   execution so that we could let outer threading to enable thread
   parallelism.
2. Maintain a separate ReadyQueue per CPU thread, and stash the
   ReadyQueue for different devices/threads into the thread local
   shared_ptr, the Engine itself will memorize the shared_ptr of the
   ReadyQueue to different devices (other than CPU)
3. The CPU thread local ReadyQueue is initialized per CPU thread
   Engine::execute call (or `backward()`, `grad()` call), and memorized
   the shared_ptr into the GraphTask since every `backward()` call have
   its own GraphTask
4. Cross device NodeTask push is accomplished by 2 and 3. we can refer
   to device's ReadyQueue from Engine, and CPU's ReadyQueue from
   GraphTask, which means if we can push to a different ReadyQueue
   according to the device
5. Termination of the CPU thread: if we mark the graph_task as
   completed, we will exit the while loop and terminate the current
   backward execution, because it's guranteed that all other NodeTasks
   is finished before we mark a GraphTask as complete
6. re-entrant thread logic keeps the same, reentrant thread detection is
   similar as before, we set the worker_device to NO_DEVICE initially
   and set to CPU afterward to detect if this is a reentrant call or not.
7. we still have the reentrant thread pool that create new threads if it's
   a deep reentrant case, and reuse the ReadyQueue with the parent thread
   for performance.

Since we introduce the thread parallelism on CPU, we have to ensure the
thread safety of the GraphTask. This is not a problem if we execute all
forward in different threads since we will build separate GraphTask in
different threads, and each GraphTask is a separate instance that share
nothing, i.e. Hogwild training on CPU should be fine on this case.

But there might be case that user would like to do some part of the task in
a single thread, and do the rest of work in several threads
concurrently, so thread safety is crucial in those cases. The thread
safety strategy for the multithread autograd is as follows:

1. Add a mutex to protect thread safety in Autograd Node/Function, and
   hold the lock for different data racing cases
2. Lock the mutex during Node::apply(), this is to ensure Node that
   writing to the shared variable are not racing across threads (i.e.
   AccumulateGrad and custom C++ Autograd Node if writing to shared
   variables )
3. Lock the mutex during Node::release_variables(), this serve the
   purpose that when we release saved_variables from one thread, no
   other threads can call the Node::apply(), this ensures the variable
   references from other threads aren't dangling.
4. If we don't release any variables and no shared data read/write in
   the Node i.e. purely functional, we don't lock the mutex

This way we could protect the thread safety on Autograd Node, but we
could still not protect the thread safety on Node pre/post C++ hooks
(python hooks are automatically thread safe), we rely on the user to
write thread safe C++ hooks if they want the hook to be correctly
applied in multithreading environment.

**User visiable changes**:
There're not too much user visiable changes, since we use the owning
thread to drive the autograd execution, user could write their own
threading code and does not block on the Autograd engine, some behaviors
that user should be aware of:

**Non-determinism**:
if we are calling backward() on multiple thread concurrently but with
shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.

But this is expected pattern if user are using the multithreading
approach to drive the whole training process but using shared
parameters, user who use multithreading should have the threading model
in mind and should expect this to happen. User should use the functional
interface `torch.autograd.grad()` to calculate the gradients instead of
`backward()` on loss.

**Graph retaining**:
If part of the autograd graph is shared between threads, i.e. run first
part of forward single thread, then run second part in multiple threads,
then the first part of graph is shared. In this case different threads execute grad() or backward() on the same graph might
have issue of destroying the graph on the fly of one thread, and the
other thread will crash in this case. We will error out to the user
similar to what call `backward()` twice with out `retain_graph=True`, and let the user know they should use `retain_graph=True`.

**TODOs**:

[ ] benchmark the PR with example models and datasets to demonstrate
the performance gain in CPU training
[ ] ensure that we don't regress the single thread autograd performance

**Follow ups**:

[ ] a correct and tight integration with distributed autograd
[ ] try to unify the thread pool between JIT and Autograd, and see if
there's unifying pattern that we could apply universally

Test Plan: Imported from OSS

Differential Revision: D20236771

Pulled By: wanchaol

fbshipit-source-id: 1e0bd4eec14ffebeffdb60b763b8d6f0e427eb64
2020-03-26 17:17:52 -07:00
Michael Carilli
4bdfc71421 Fix race condition for to() backward that spans devices (#31930)
Summary:
While putting finishing touches on the gradient scaling PR (https://github.com/pytorch/pytorch/pull/26512), I discovered my multi-GPU test (which uses `to()` to transfer tensors between devices) was intermittently failing with bad numerics.  I knew it was going to be [a weird case from the start](https://www.imdb.com/title/tt8946378/quotes/qt4868203) and spent a week descending into madness.  It turns out, for backward ops that create gradients on a different device from the device on whose stream the op is executed, the streaming backward synchronizations in [input_buffer.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L46-L83) do not properly tell later ops to wait on the population/creation of those gradients.  For example, a cross-device `to()` backward (CopyBackward Node) enqueues a cudaMemcpyAsync on the current stream of the source (incoming gradient's) device, then [syncs getCurrentCUDAStream on the destination device with the cudaMemcpyAsync](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Copy.cu#L76).  However, `input_buffer.cpp` in such cases ([case (3)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L77-L81)) was not properly telling `opt_consumer_stream` to wait on the current stream of the destination device (`var`'s device).

Circumstances needed to repro in current master (see [my test](https://github.com/pytorch/pytorch/compare/master...mcarilli:backward_to_race_fix#diff-e68a7bc6ba14f212e5e7eb3727394b40R1901)):
- 2 devices, with non-default streams used for forward-pass ops on both devices (which is the default behavior in test_cuda.py)
- A `to()` that transfers a tensor requiring grad from one device to another
- A backward pass that routes back through to()'s backward (aka CopyBackward).

Under these circumstances, backward ops following CopyBackward on CopyBackward's destination device (aka the original forward-pass source device) race with the device-to-device transfer, and execute using partially-transferred data.

The present PR fixes the race condition and ensures that later ops wait on the CopyBackward transfer.  This PR should also make streaming backward safe for other backward ops that span devices, as long as they play nice and populate any new gradients they create using the "current stream" of the device(s) on which they create those gradients.

There are a couple minor issues where I'm not sure of the best approach:
- Should we guard onto the var's device for the entire body of InputBuffer::add?
- I'm fairly sure we need to `recordStream` on `var` if the consumer stream is different from the stream on which (we expect) `var` was created, but calling `c10::cuda::CUDACachingAllocator::recordStream` in input_buffer.cpp might break CPU-only builds.  I couldn't find a different API call to record streams that seemed CPU-build-agnostic.  Could I wrap the call with a macro?

Thanks to mruberry for helpful suggestions and also the organization/naming of the stream pool and streaming backward code that allowed me to (just barely) wrap my head around the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31930

Differential Revision: D19517617

Pulled By: mruberry

fbshipit-source-id: 183d5460aefa5d27366b465b0473b80ec80fa044
2020-01-22 16:32:24 -08:00
Mike Ruberry
87a2c92615 Updates autograd engine to respect streams set in forward (#8354)
Summary:
This PR addresses issue https://github.com/pytorch/pytorch/issues/7601.

Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user.

For example, a model with forward declared like (original example courtesy of ngimel):

```
def forward(self,x):
        x0 = x.clone()
        torch._C._cuda_setStream(self.stream1._cdata)
        y0 = self.fc1(x0)
        self.event1.record(stream = torch.cuda.current_stream())

        torch._C._cuda_setStream(self.stream2._cdata)
        y1 = self.fc2(x)
        self.event2.record(stream = torch.cuda.current_stream())
        self.stream2.wait_event(self.event1)
        return y0 + y1
```

currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously.

The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer.

In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354

Test Plan: Two tests were added specifically for this behavior.

Differential Revision: D17275980

Pulled By: mruberry

fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d
2019-09-10 23:46:51 -07:00
Mikhail Zolotukhin
6ca38d9840 Cleanup includes in torch/csrc/autograd/* (#19923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19923
ghimport-source-id: 54debdd21ca0f4230b1915905673de274807a2e5

Differential Revision: D15125016

Pulled By: ZolotukhinM

fbshipit-source-id: 8d54f436e4508067089a1d05ce192093220aa1bb
2019-05-06 13:48:42 -07:00
Davide Libenzi
272a48f6fe Enable autograd to recognize the XLA backend as one providing multiple devices (#17847)
Summary:
…e devices, while not being CUDA/HIP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17847

Differential Revision: D14545634

Pulled By: ezyang

fbshipit-source-id: 417181bf2ff4f8978544afe2fb6b042e787854ed
2019-03-20 13:58:36 -07:00
Natalia Gimelshein
b4572668b4 Add sparse gradient option to gather operation (#17182)
Summary:
This PR allows `gather` to optionally return sparse gradients, as requested in #16329. It also allows to autograd engine to accumulate sparse gradients in place when it is safe to do so.
I've commented out size.size() check in `SparseTensor.cpp` that also caused #17152, it does not seem to me that check serves a useful purpose, but please correct me if I'm wrong and a better fix is required.
Motivating example:
For this commonly used label smoothing loss function
```
def label_smoothing_opt(x, target):
    padding_idx = 0
    smoothing = 0.1
    logprobs = torch.nn.functional.log_softmax(x, dim=-1, dtype=torch.float32)
    pad_mask = (target == padding_idx)
    ll_loss = logprobs.gather(dim=-1, index=target.unsqueeze(1), sparse = True).squeeze(1)
    smooth_loss = logprobs.mean(dim=-1)
    loss =  (smoothing - 1.0) * ll_loss - smoothing * smooth_loss
    loss.masked_fill_(pad_mask, 0)
    return loss.sum()
```
backward goes from 12.6 ms with dense gather gradients to 7.3 ms with sparse gradients, for 9K tokens x 30K vocab, which is some single percent end-to-end improvement, and also improvement in peak memory required.
Shout-out to core devs: adding python-exposed functions with keyword arguments through native_functions.yaml is very easy now!

cc gchanan apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17182

Differential Revision: D14158431

Pulled By: gchanan

fbshipit-source-id: c8b654611534198025daaf7a634482b3151fbade
2019-02-27 11:42:48 -08:00
Edward Yang
517c7c9861 Canonicalize all includes in PyTorch. (#14849)
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.

I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.

I used the following script to do the canonicalization:

```
  import subprocess
  import re
  import os.path

  files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
  for fn in files:
      if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
          continue
      if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
          continue
      with open(fn, 'r') as f:
          c = f.read()
      def fmt(p):
          return "#include <{}>".format(p)
      def repl(m):
          p = m.group(1)
          if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
              return fmt(p)
          if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
              return fmt(p)
          for root in ["aten/src", "torch/lib", ""]:
              for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
                  new_p = os.path.relpath(os.path.join(bad_root, p), root)
                  if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
                      return fmt(new_p)
          print("ERROR: ", fn, p)
          return m.group(0)
      new_c = re.sub(r'#include "([^"]+)"', repl, c)
      if new_c != c:
          print(fn)
          with open(fn, 'w') as f:
              f.write(new_c)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849

Reviewed By: dzhulgakov

Differential Revision: D13363445

Pulled By: ezyang

fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
2018-12-08 19:38:30 -08:00
Edward Yang
e35418b3be New implementations of DeviceGuard, StreamGuard and MultiStreamGuard (with CUDA specializations) (#13342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342

This PR introduces a few new concepts:

- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
  provide a generic interface for interfacing with device and stream state,
  without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
  and dynamically dispatched device guard implementations.  Dynamic
  dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
  from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
  but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
  devices.
- Optional variants of all the aforementioned guards, which are a no-op if
  no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
  a device on every guard.

There are some subtle semantic changes, which have been thoroughly documented
in the class definition.

BC-breaking changes:

- Move constructor/assignment have been removed from all device guard
  implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
  'reset_device', because if you switch devices/device types, the stream/device on the
  previous device is unset.  This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams.  Use CUDAStreamGuard
  or CUDAMultiStreamGuard as appropriate for your use case.

Reviewed By: dzhulgakov

Differential Revision: D12849620

fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
2018-11-11 12:11:10 -08:00
Richard Zou
e60a7c2c88 codemod tensor.type().is_cuda(), tensor.type().is_sparse() (#13590)
Summary:
Followup to #12841

Changed these to not require type dispatch:
tensor.type().is_cuda() -> tensor.is_cuda()
tensor.type().is_sparse() -> tensor.is_sparse()
isVariable(tensor.type()) -> tensor.is_variable()

This probably does not affect performance
very much in most cases but it is nice to have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13590

Reviewed By: ezyang

Differential Revision: D12929301

Pulled By: zou3519

fbshipit-source-id: 8ac5c6200c579dd7a44fb4ee58fc9bb170feb1d7
2018-11-07 07:27:42 -08:00
Peter Goldsborough
f62bc01dfe Remove TORCH_ASSERT (#9575)
Summary:
I got some tensor->variable conversion exceptions from `torch/csrc/autograd/variable.h`, which used the `TORCH_ASSERTM` macros instead of `AT_CHECK`, so they didn't have backtraces. This was such a substantial loss for debugability that I decided to update the whole codebase to use the backtrace-enabled ATen macros instead of `TORCH_ASSERT` and `JIT_ASSERT`, the latter having been an alias of the former.

ezyang apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9575

Differential Revision: D8924566

Pulled By: goldsborough

fbshipit-source-id: 7a4013b13eec9dbf024cef94cf49fca72f61d441
2018-07-24 18:10:06 -07:00
Peter Goldsborough
372d1d6735
Create ATen tensors via TensorOptions (#7869)
* Created TensorOptions

Storing the type in TensorOptions to solve the Variable problem

Created convenience creation functions for TensorOptions and added tests

Converted zeros to TensorOptions

Converted rand to TensorOptions

Fix codegen for TensorOptions and multiple arguments

Put TensorOptions convenience functions into torch namespace too

All factory functions except *_like support TensorOptions

Integrated with recent JIT changes

Support *_like functions

Fix in place modification

Some cleanups and fixes

Support sparse_coo_tensor

Fix bug in Type.cpp

Fix .empty calls in C++ API

Fix bug in Type.cpp

Trying to fix device placement

Make AutoGPU CPU compatible

Remove some auto_gpu.h uses

Fixing some headers

Fix some remaining CUDA/AutoGPU issues

Fix some AutoGPU uses

Fixes to dispatch_tensor_conversion

Reset version of new variables to zero

Implemented parsing device strings

Random fixes to tests

Self review cleanups

flake8

Undo changes to variable.{h,cpp} because they fail on gcc7.2

Add [cuda] tag to tensor_options_cuda.cpp

Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks

Fix linker error in AutoGPU.cpp

Fix bad merge conflict in native_functions.yaml

Fixed caffe2/contrib/aten

Fix new window functions added to TensorFactories.cpp

* Removed torch::TensorOptions

Added code to generate wrapper functions for factory methods

Add implicit constructor from Backend to TensorOptions

Remove Var() from C++ API and use torch:: functions

Use torch:: functions more subtly in C++ API

Make AutoGPU::set_device more exception safe

Check status directly in DynamicCUDAHooksInterface

Rename AutoGPU to DeviceGuard

Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad

remove python_default_init: self.type()

Add back original factory functions, but with deprecation warnings

Disable DeviceGuard for a couple functions in ATen

Remove print statement

Fix DeviceGuard construction from undefined tensor

Fixing CUDA device compiler issues

Moved as many methods as possible into header files

Dont generate python functions for deprecated factories

Remove merge conflict artefact

Fix tensor_options_cuda.cpp

Fix set_requires_grad not being checked

Fix tensor_new.h

TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac

Fix bug in DeviceGuard.h

Missing includes

TEMPORARILY moving a few more methods into .cpp to see if it fixes windows

Fixing linker errors

* Fix up SummaryOps to use new factories

Undo device agnostic behavior of DeviceGuard

Use -1 instead of optional for default device index

Also move DeviceGuard methods into header

Fixes around device index after optional -> int32_t switch

Fix use of DeviceGuard in new_with_tensor_copy

Fix tensor_options.cpp

* Fix Type::copy(

* Remove test_non_float_params from ONNX tests

* Set requires_grad=False in ONNX tests that use ints

* Put layout/dtype/device on Tensor

* Post merge fixes

* Change behavior of DeviceGuard to match AutoGPU

* Fix C++ API integration tests

* Fix flip functions
2018-06-16 00:40:35 -07:00
gchanan
516f067641 InputBuffers should AutoGPU for accumulation. (#6826) 2018-04-20 20:15:51 -04:00
Peter Goldsborough
e4f1d3b538
Better warnings (#6428)
* Better warnings

* Remove -Wc++14-extensions because gcc does not know it

* Warning fix in input_buffer.cpp

* Remove pedantic for torch/csrc/

* Also use Wextra and Wall for ATen

* Use check_env_flag

* Undo changes in shape_analysis.cpp

* Remove C linkage flag
2018-04-10 23:34:25 -07:00
Adam Paszke
16b7f3a35d Clean up InputBuffer 2017-12-14 15:14:35 +01:00
Gregory Chanan
9a2b54e08b [ATen] Rename isCuda -> is_cuda. 2017-11-15 18:33:07 -08:00
Sam Gross
42de0df411
Add assertion that 'pos' is in-bounds (#3466) 2017-11-03 12:54:55 -04:00
Sam Gross
1290e586fb Use at::Tensor based autograd Variable (#2676)
Variable is now a subclass of at::Tensor backed by a VariableImpl* pImpl. The implementation of the ATen functions is defined in the auto-generated VariableType.h/cpp file.

Currently, only functions which fall through to the base type, such as sizes() and isCuda() are implemented. Differentiable ops like add() and mul() will be added in a subsequent PR.
2017-09-12 11:36:01 -04:00
Adam Paszke
ba2b2bcdc1 Change calling convention for C++ autograd functions 2017-09-05 17:48:55 -04:00
Trevor Killeen
c304d04fc6 Replace thpp::Tensor with ATen Tensor in autograd csrc (#2170) 2017-07-28 10:18:37 -04:00
albanD
626840aef3 C function wrapper uniqueness (#1912)
* add SharedFunctionMaker to create Function shared in the graph

* Clean shared_ptr usage for only function that will be used in the graph

* make Function binding match Varible one

* remove unnecessary changes

* fix comments

* proper weakref implementation

* add call to clear in dealloc
2017-07-25 13:12:54 +05:30
Adam Paszke
2ca787fcf4 Refactor attribute names in autograd 2017-05-01 16:44:56 -04:00