Commit Graph

107 Commits

Author SHA1 Message Date
Jeremy Lilley
1b2d2ba504 [PyTorch] Fix write-after-free (TSAN) in GraphTask::set_error() (#33156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33156

When dist_autograd_spawn_thrift's 'test_backward_node_failure_python_udf' test is
run, it was encountering a TSAN error related to holding the mutex while the
underlying datastructure was being dealloced.

In this change, we simply get a shared_ptr<> reference to the future, and
set_exception() without having the lock held, to avoid deallocing underneath
the lock.
ghstack-source-id: 98303434

Test Plan: buck test mode/opt-tsan //caffe2/test/distributed/rpc:dist_autograd_spawn_thrift -- 'test_backward_node_failure_python_udf \(test_dist_autograd_spawn\.DistAutogradTestWithSpawn\)'

Differential Revision: D19821362

fbshipit-source-id: 82f735e33f8e608552418ae71592400fa3621e40
2020-02-14 09:32:17 -08:00
Pritam Damania
05d18ffaf5 Distributed Autograd: Allow multiple backward passes to accumulate gradients. (#32506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32506

In this PR, we've introduced a `retain_graph` parameter to distributed
autograd similar to `torch.autograd.backward`.

In terms of design, this parameter is sent over RPC to all nodes and is used to
create the GraphTask on the local nodes. This enables us to run
`dist_autograd.backward()` multiple times in the same context.

The use case currently for this is to benchmark only the backward pass for
distributed autograd. We'd like to measure the QPS for the backward pass and as
a result, running a single forward pass and multiple backward passes in a loop
is one way to benchmark backward pass performance.
ghstack-source-id: 97868900

Test Plan: waitforbuildbot

Differential Revision: D19521288

fbshipit-source-id: 7ad8521059fd400d7b5a6ab77ce56e1927ced90a
2020-02-06 23:27:21 -08:00
Edward Yang
67c1d930eb Lock graph_task before writing leaf_streams. (#31995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31995

Fixes #31906.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D19331259

Pulled By: ezyang

fbshipit-source-id: 5d24bf3555e632211a9b6f8e50ff241603c18b3d
2020-01-09 13:26:36 -08:00
Alban Desmaison
1314f7f4f4 Ensure the original grad_mode is restored during backward (#31884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31884

Fix #31715

Test Plan: Imported from OSS

Differential Revision: D19301076

Pulled By: albanD

fbshipit-source-id: 2d20c01bfb6364fa96c8fe5aa5ce7ea39defa3ce
2020-01-08 14:16:51 -08:00
Pritam Damania
bf8e1c0710 Integrate async mode for autograd engine with distributed autograd. (#31508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31508

This PR builds on top of https://github.com/pytorch/pytorch/pull/31230
to ensure that distributed autograd doesn't block an RPC thread anymore during
the backward pass.

I've also added a unit test where all ranks hammer rank 0 without about 60
backward calls (which would cause a deadlock earlier), but now such a test
passes without any issues.
ghstack-source-id: 96345097

Test Plan: waitforbuildbot

Differential Revision: D19188749

fbshipit-source-id: b21381b38175699afd0f9dce1ddc8ea6a220f589
2020-01-07 11:01:16 -08:00
Pritam Damania
5cc62f2913 Ensure autograd callbacks are called only once for reentrant backward. (#31909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31909

https://github.com/pytorch/pytorch/pull/31230 introduced a bug where
we would end up calling `graph_task_post_processing` twice for reentrant
backward calls (once when we mark the future completed and then we we called
graph_task_post_processing in execute_with_graph_task).

This PR fixes the issues by verifying the future we return in that case is
completed and we remove the call to graph_task_post_processing.

In addition to that I added a test that reproduced the problem and verified it
is fixed by this PR.
ghstack-source-id: 96349102

Test Plan: waitforbuildbot

Differential Revision: D19296363

fbshipit-source-id: dc01a4e95989709ad163bb0357b1d191ef5a4fb2
2020-01-07 10:35:04 -08:00
Pritam Damania
fde94e7556 Provide async mode for local autograd engine. (#31230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31230

A major issue with distributed autograd currently is that we block an
RPC thread when we call Engine::execute_with_graph_task.

To resolve this issue, I've made modifications to the local autograd engine
such that `execute_with_graph_task` returns a Future instead. The `execute()`
methods for Engine::execute() and DistEngine::execute() still wait() on this
Future which ensures there is no change in behavior yet.

In follow up PRs we can modify the distributed autograd engine to take
advantage of this Future.

Closes #26359
ghstack-source-id: 96298057

Test Plan: waitforbuildbot

Differential Revision: D18999709

fbshipit-source-id: 388f54467fd2415a0acb7df17bd063aedc105229
2020-01-05 00:29:28 -08:00
Shen Li
8a57362000 Fix index out of bound error in Engine::ready_queue_size when called before start_threads
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30967

Test Plan: Imported from OSS

Differential Revision: D18887178

Pulled By: mrshenli

fbshipit-source-id: 67baeac9214a4749ce7e9b4d89862c93620b2d5e
2019-12-09 14:39:07 -08:00
Pritam Damania
776fdda753 Add debug info API for distributed autograd. (#30642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30642

Adding a couple of basic metrics for distributed autograd which would
help in determining stuckness.
ghstack-source-id: 95156189

Test Plan: waitforbuildbot

Differential Revision: D18776478

fbshipit-source-id: a0556ad6fe2b7c3cd0082ee2350c1c78cafaaec5
2019-12-07 13:56:51 -08:00
Nathan Goldbaum
f531815526 Deprecate tensor.type() (#30281)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/29161.

I looked a bit at the code changes related to this and think I have all of the use cases of `DeprecatedTypeProperties` covered in the message, but suggestions from someone with more context on this would be very much appreciated :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30281

Differential Revision: D18830818

Pulled By: ezyang

fbshipit-source-id: 1a7fcee15354ae09e6644577e7fa33bd26acfe20
2019-12-05 10:55:34 -08:00
Prasun Anand
3cf8382984 detect_anomaly() for SparseTensors (#29803)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/28649

1. Modified detect_anomaly() to use isnan()
2. isnan() for SparseTensors returns a bool Tensor of _values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29803

Differential Revision: D18594299

Pulled By: ezyang

fbshipit-source-id: 3f4190c569f53219be330584fc604ca43c4a6c7a
2019-12-03 15:42:51 -08:00
Pritam Damania
1322daa506 Improve error handling for distributed autograd engine. (#27940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377

Test Plan: Added unit tests to test failures.

Differential Revision: D17916844

fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
2019-10-25 12:07:27 -07:00
Pritam Damania
3bccd3fc0d Distributed Autograd - FAST mode backward pass implementation. (#27022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022

This change implements the "FAST" mode distributed autograd backward
pass as described in https://github.com/pytorch/pytorch/issues/23110.

At a high level the backward pass works as follows:
1. We start by computing dependencies on the node that calls
`torch.distributed.backward`.
2. This node computes the dependencies starting from the root nodes provided in
the backward call and all the 'send' functions present in the current autograd
context. The "FAST" mode assumes all 'send' functions are part of the autograd
computation.
3. Once the dependency computation is done, the distributed autograd engine
calls the local autograd engine to execute the autograd graph. Note that the
autograd graph on a single node is not necessarily connected because of
inter-node communication. As a result, we have special handling to ensure the
local autograd engine ensures we execute the entire graph starting from the
provided roots and all 'send' functions on the node.
4. When the local autograd engine hits a 'recv' function, it performs an async
RPC to send the gradients over to the appropriate node and stores a future in
the autograd context to keep track of this RPC.
5. On the destination node, the appropriate 'send' function is looked up and
enqueued on the local autograd engine. If this is the first time the node is
hearing about this autograd context id on the backward pass, then the node
computes dependencies for the local autograd engine.
6. As part of compute dependencies, the distributed autograd engine discovers
all leaf nodes and ensures those are passed as 'outputs' to the local autograd
engine. This avoids running the 'AccumulateGrad' function.
7. The gradients computed for the leaf nodes are then actually accumulated in
`DistAutogradContext` for the appropriate autograd context id.
8. The distributed autograd engine waits for the local autograd engine
to complete and also waits for all the 'Futures' (stored in 4.) for respective
RPCs to finish.

We have made the following changes to the local autograd engine for this
purpose:

1. Expose GraphTask and NodeTask so that the distributed autograd engine can
use them.
2. Expose a `execute_with_graph_task` API which gives the distributed engine
to build a GraphTask and pass it to the local autograd engine.
3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build
a `NodeTask` for a 'send' function and enqueue it on the local autograd engine.

In addition to this a few general improvements:
1. Added a `PropagateGradients` RPC call for the 'recv' function to pass
gradients to the appropriate node during the backward pass.
2. Use IValues as much as possible in serialization for RpcWithAutograd.
3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate
exception instead of just returning the message. This is inline with what most
Future.wait() APIs do.
4. Added a `get_gradients(context_id)` API which allows users to retrieve a map
from Tensor to respective gradient for the provided context_id on the local
node.
ghstack-source-id: 91794926

Test Plan: unit tests.

Differential Revision: D17652615

fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3
2019-10-12 09:47:49 -07:00
Mike Ruberry
87a2c92615 Updates autograd engine to respect streams set in forward (#8354)
Summary:
This PR addresses issue https://github.com/pytorch/pytorch/issues/7601.

Currently models that use streams explicitly in forward have to do a lot of extra work to make backwards respect those streams. This PR extends the (recently added) input tracing (see TypeAndShape) to record the devices and streams of inputs. The autograd engine then uses this metadata to enact the expected stream parallelism without extra work from the user.

For example, a model with forward declared like (original example courtesy of ngimel):

```
def forward(self,x):
        x0 = x.clone()
        torch._C._cuda_setStream(self.stream1._cdata)
        y0 = self.fc1(x0)
        self.event1.record(stream = torch.cuda.current_stream())

        torch._C._cuda_setStream(self.stream2._cdata)
        y1 = self.fc2(x)
        self.event2.record(stream = torch.cuda.current_stream())
        self.stream2.wait_event(self.event1)
        return y0 + y1
```

currently will backward on a single stream. With this change the kernels will go on the streams they are assigned in forward and both forward and backward will (for appropriate sizes) run the fc1 and fc2 kernels simultaneously.

The crux of this change is, as mentioned, an expansion of the TypeAndShape tracing and a relatively simple change to the autograd engine to use cuda events for stream synchronization. To make this efficient I also added a new AutoGPUAndStream class, exposed getting and setting streams on devices, and removed InputBuffer's AutoGPU (it's now redundant). While making these modifications I also fixed AutoGPU to check before setting the GPU when it's destroyed and to use THCudaCheck instead of its custom error handler. These changes mean that an often excessive cudaSetDevice() is not being called when inputs are added to a buffer.

In addition to allowing users to easily set and use streams that are respected in both forward and backward, this change may encourage modules to do the same and the expanded tracing might allow further optimizations in the autograd engine. (apaszke, for example, now after initial enumeration we know the number of devices that will be used by a graph task, which might help provide a sense of the "level of parallelism" we should expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8354

Test Plan: Two tests were added specifically for this behavior.

Differential Revision: D17275980

Pulled By: mruberry

fbshipit-source-id: 92bd50ac782ffa973b159fcbbadb7a083802e45d
2019-09-10 23:46:51 -07:00
Brian Vaughan
88e4cee3e7 Improve handling of mixed-type tensor operations (#22273)
Summary:
Improve handling of mixed-type tensor operations.

This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops).

For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts.

The details of the promotion rules are described here:
https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst

Some specific backwards incompatible examples:
* now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)`
* Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result.

See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR:
https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273

Reviewed By: gchanan

Differential Revision: D16582230

Pulled By: nairbv

fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3
2019-09-05 18:26:09 -07:00
Ilia Cherniavskii
936632b120 Thread local debug info
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22365

Test Plan:
USE_CUDA=0 python setup.py develop
./build/bin/test_jit

Imported from OSS

Reviewed By: ajyu

Differential Revision: D16065129

Pulled By: ilia-cher

fbshipit-source-id: f300985459a83c2c1049ed8c4fefd23a3144047a
2019-08-12 14:53:57 -07:00
mal
e7a9b0d62f Rename torch::autograd::Function to torch::autograd::Node
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23269

Test Plan: Imported from OSS

Differential Revision: D16454878

fbshipit-source-id: b1e840fc2d3901955280d141e5ad6efd5e9d66af
2019-07-23 20:52:22 -07:00
mal
0140a756d8 Prioritize reentrant tasks and execute them recursively until close to limit
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22397

Test Plan:
Added test for reentrant backwards with checkpoint and a test for a recursive backwards function (which should fail if we run all the reentrant tasks recursively in the same thread) and for testing priority of reentrant tasks.
~~Will add a test for priority of reentrant tasks in future pr.~~

Imported from OSS

Differential Revision: D16131955

fbshipit-source-id: 18301d45c1ec9fbeb566b1016dbaf7a84a09c7ac
2019-07-05 08:51:06 -07:00
mal
c8b5f1d2f8 Switch autograd to use a pool of workers for each device (#21911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21911
ghimport-source-id: 3b7d37481201aa4b4ca8f7767603d0dfd13f871f

Test Plan:
Tested on https://github.com/pytorch/pytorch/issues/6959 and ensured no Recursion Error.
Performance testing:
[word_language_model](https://gist.github.com/malvika2147/34c214871d549f9275812f2d20506990) (no significant change)
[mnist](https://gist.github.com/malvika2147/77890eef102099490a1029122fb20dd0) (no significant change)
[Comparison of performance](https://gist.github.com/malvika2147/c0a8790910b8513bd2e20b224bdd6300) on https://github.com/pytorch/pytorch/issues/6959 with smaller inputs. (slower by about ~25%, expected)

Imported from OSS

Differential Revision: D15985852

fbshipit-source-id: ca172690857fd1718462b80f3a244af9d8825d6c
2019-06-25 09:08:26 -07:00
mal
f308b07e8c Don't leak threads on exit (#21438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21438
ghimport-source-id: 33f145f5b3508163365442c22a223c4a44e677d8

Differential Revision: D15738856

fbshipit-source-id: 656e8d0e3d0d22f116e3ab66bf0282608d6f1a76
2019-06-10 09:14:13 -07:00
Malvika Joshi
9deab0cf0e Documentation for locking discipline in engine.cpp/.h (#21548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21548

Added documentation as titled.

Reviewed By: ezyang

Differential Revision: D15723146

fbshipit-source-id: fab4a35c62f07256673318c0874701f7628b2f7a
2019-06-10 07:50:01 -07:00
mal
4980b8b95c Renaming member variables in engine.cpp/h (#21283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21283
ghimport-source-id: 360a138e420ace3cd4ca6ccbc761c8e68319440d

Differential Revision: D15607428

fbshipit-source-id: f8df6b42796a49c4d68fa8366b6a68d5715f6421
2019-06-03 12:54:50 -07:00
Roy Li
ab78449e8c Add ScalarType argument to Type::options() (#19270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19270
ghimport-source-id: a5ade6131f3260066c5750ea1fa9ed5c998bb791

Differential Revision: D14938707

Pulled By: li-roy

fbshipit-source-id: 018fb3f01706531a06515d6d861e5683a455a705
2019-04-21 21:16:07 -07:00
Ilia Cherniavskii
646cb6157d Move OMP/MKL thread initialization into ATen/Parallel (#19011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19011
ghimport-source-id: 432e31eccfd0e59fa21a790f861e6b2ff4fdbac6

Differential Revision: D14846034

Pulled By: ilia-cher

fbshipit-source-id: d9d03c761d34bac80e09ce776e41c20fd3b04389
2019-04-16 00:16:32 -07:00
Roy Li
c705d9eb1e Introduce DeprecatedTypeProperties class (#17991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17991

changes:
-Breaks bc: Tensor::type() now returns DeprecatedTypeProperties& rather than Type&.
-Added DeprecatedTypeProperties, it serves as a temporary replacement for Type as the return value of Tensor::type(). This contributes to making Type just for dispatch purposes so that we can make it dtype agnostic.
-Tensor::dispatch_type() now returns Type& like Tensor::type() used to do.
-Changed callsites of Tensor::type() appropriately.

Reviewed By: ezyang

Differential Revision: D14443117

fbshipit-source-id: 239ccb7a09626279a71d1a37f8f82e7f57bf7d9e
2019-04-04 02:24:13 -07:00
Davide Libenzi
272a48f6fe Enable autograd to recognize the XLA backend as one providing multiple devices (#17847)
Summary:
…e devices, while not being CUDA/HIP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17847

Differential Revision: D14545634

Pulled By: ezyang

fbshipit-source-id: 417181bf2ff4f8978544afe2fb6b042e787854ed
2019-03-20 13:58:36 -07:00
bddppq
27f6a29fd0 Remove USE_CUDA and USE_ROCM in engine.cpp
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15893

Differential Revision: D13627319

Pulled By: zdevito

fbshipit-source-id: 7c72c1c6cc242143fb66383423c668c9b9810884
2019-01-10 14:45:11 -08:00
albanD
828cb18fa3 Allow ReadyQueue to handle empty tasks (#15791)
Summary:
Allow the comparison function used in ReadyQueue to handle the empty FunctionTasks created by the reentrant autograd.
Fix #11732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15791

Differential Revision: D13598006

Pulled By: soumith

fbshipit-source-id: 0bfdf28a735fbfe44f0fdbaf8b74a6198e6a1984
2019-01-08 20:06:04 -08:00
Peter Goldsborough
7a61306031 Enable all clang-tidy performance checks (#15198)
Summary:
This PR adds the final set of clang-tidy checks we should add for our codebase: a last set of performance-related checks. Most fixes here are around changing `auto` to `const auto&` in a few places where unnecessary copies were made, and adding `reserve()` calls before loops doing repeated `push_back()`. Also a few cases of calling `std::string::find` with a single-character string literal instead of a single char, which uses a less efficient string search algorithm meant for searching larger substrings.

![image](https://user-images.githubusercontent.com/6429851/49978940-adc1a780-ff01-11e8-99da-a4e431361f07.png)

ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15198

Differential Revision: D13468797

Pulled By: goldsborough

fbshipit-source-id: 2bed1ea1c7c162b7f3e0e1026f17125e88c4d5b2
2018-12-14 13:32:47 -08:00
Peter Goldsborough
1e9c384afb Enable performance-unnecessary-value-param in .clang-tidy (#15026)
Summary:
This PR fixes around 250 places in the codebase where we were making unnecessary copies of objects (some large, some small).

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15026

Differential Revision: D13458784

Pulled By: goldsborough

fbshipit-source-id: be5148b2ce09493588d70952e6f6d6ff5ec5199b
2018-12-13 16:15:35 -08:00
Edward Yang
2d485ffb17 Move CUDAGuard, CUDAStream and CUDAGuardImpl to c10/cuda (#14248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248

This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.

Reviewed By: bddppq

Differential Revision: D13145293

fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
2018-12-12 11:24:26 -08:00
Edward Yang
b710642969 Make ATen HIPify out-of-place, but still reuse CUDA names. (#14866)
Summary:
```
    This diff changes the HIPification of ATen to be out-of-place.
    We now have the following mappings:

    - ATen/cuda => ATen/hip
    - ATen/native/cuda => ATen/native/hip
    - ATen/native/sparse/cuda => ATen/native/sparse/hip
    - THC => THH
    - THCUNN => THHUNN

    The build system is adjusted to know about these new build paths,
    and HIPify is taught how to adjust include paths and
    THC_GENERIC_FILE appropriately.  ATen_hip is now built as
    the ATen_hip library, rather than reusing ATen_cuda.

    However, despite these new filepaths, none of the identifiers in ATen
    have actually changed.  So, e.g., THHGeneral.h still defines functions
    named THC_blahblah, and HIP still shows up as CUDA in PyTorch itself.
    We'll tackle this in a subsequent PR; this diff is just to get the files
    out-of-place.

    Minor extra improvements:

    - Don't edit tmp_install when hipifying
    - HIP no longer builds native_cudnn_cpp; it was unnecessary
    - Caffe2_HIP_INCLUDES is now Caffe2_HIP_INCLUDE, for consistency
      with all the other variables.
    - HIP build now properly respects ATEN_CUDA_FILES_GEN_LIB (it
      did not previously.)
    - You can now override file extension matching in pyHIPIFY
      by explicitly specifying its full name in the matching list.
      This is used so we can HIPify CMakeLists.txt in some situations.

    A little bit of string and ceiling wax:

    - gen.py grows a --rocm flag so that it knows to generate CUDA
      files which actually refer to the HIP headers (e.g., THH.h)
      We'll get rid of this eventually and generate real HIP files,
      but not for this PR.
    - Management of HIP dependencies is now completely deleted
      from the ATen CMakeLists.txt.  The old code was dead (because
      it was shoveled in ATen_CUDA_DEPENDENCY_LIBS and promptly
      ignored by the Caffe2 build system) and didn't actually work.
```

Stacked on https://github.com/pytorch/pytorch/pull/14849 review last commit only
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14866

Differential Revision: D13419475

Pulled By: ezyang

fbshipit-source-id: cb4c843df69a1d8369314c9fab1b7719520fa3db
2018-12-11 19:15:27 -08:00
Edward Yang
517c7c9861 Canonicalize all includes in PyTorch. (#14849)
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.

I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.

I used the following script to do the canonicalization:

```
  import subprocess
  import re
  import os.path

  files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
  for fn in files:
      if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
          continue
      if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
          continue
      with open(fn, 'r') as f:
          c = f.read()
      def fmt(p):
          return "#include <{}>".format(p)
      def repl(m):
          p = m.group(1)
          if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
              return fmt(p)
          if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
              return fmt(p)
          for root in ["aten/src", "torch/lib", ""]:
              for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
                  new_p = os.path.relpath(os.path.join(bad_root, p), root)
                  if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
                      return fmt(new_p)
          print("ERROR: ", fn, p)
          return m.group(0)
      new_c = re.sub(r'#include "([^"]+)"', repl, c)
      if new_c != c:
          print(fn)
          with open(fn, 'w') as f:
              f.write(new_c)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849

Reviewed By: dzhulgakov

Differential Revision: D13363445

Pulled By: ezyang

fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
2018-12-08 19:38:30 -08:00
Junjie Bai
6651fae827 Make autograd engine compatible with hip
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14873

Differential Revision: D13375053

Pulled By: bddppq

fbshipit-source-id: f3051640386667bbf0566856ed433eb83276c39e
2018-12-07 00:12:06 -08:00
Adam Paszke
8812a5d42e Reduce broadcasted inputs in derivative code (#14485)
Summary:
Previously symbolic AD formulas assumed that no broadcasting happened,
and would return gradients of incorrect shapes (possibly leading to
silent errors later).

Fixes a few bugs (known and unknown):
- #11736
- ArgumentSpec didn't compute the input types correctly [(it didn't advance the offset for non-tensor args)](https://github.com/pytorch/pytorch/pull/14485/files#diff-4fd3157a056596aefb8cdf41022a208bR153)
- Symbolic AD could suffer from use after free (dangling pointers in grad map), because [`EliminateDeadCode` could have removed nodes](https://github.com/pytorch/pytorch/pull/14485/files#diff-25d33ad1ed6855684dec79d927ca6142L781) that referenced gradients of certain values.
- Undefined behavior in `aten::size`

During my tests I've also found a few new problems, and I have opened issues for them:
- FusionGroup seems to think that cat nodes broadcast their inputs (#14483)
- `prim::ConstantChunk` derivative formula doesn't handle undefined inputs (#14484)

This patch unfortunately deoptimizes some of our code (Fusion doesn't happen past chunk nodes, and outputs more tensors only because we have to get their size). I know how to fix those issues, but wanted to fix this terrible bug quickly.

cc zou3519 zdevito ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14485

Reviewed By: eellison

Differential Revision: D13312888

Pulled By: suo

fbshipit-source-id: ad46bfb4d0a306ad9451002f8270f7a790f72d58
2018-12-04 00:16:21 -08:00
Michael Suo
9ac845f734 Revert D13280899: [pytorch][PR] Reduce broadcasted inputs in derivative code
Differential Revision:
D13280899

Original commit changeset: 80cc5ec9331b

fbshipit-source-id: 2335093cca8fd7db95470fd83b9299adfa17aa8e
2018-12-03 14:55:02 -08:00
Adam Paszke
68ffe46991 Reduce broadcasted inputs in derivative code (#14485)
Summary:
Previously symbolic AD formulas assumed that no broadcasting happened,
and would return gradients of incorrect shapes (possibly leading to
silent errors later).

Fixes a few bugs (known and unknown):
- #11736
- ArgumentSpec didn't compute the input types correctly [(it didn't advance the offset for non-tensor args)](https://github.com/pytorch/pytorch/pull/14485/files#diff-4fd3157a056596aefb8cdf41022a208bR153)
- Symbolic AD could suffer from use after free (dangling pointers in grad map), because [`EliminateDeadCode` could have removed nodes](https://github.com/pytorch/pytorch/pull/14485/files#diff-25d33ad1ed6855684dec79d927ca6142L781) that referenced gradients of certain values.
- Undefined behavior in `aten::size`

During my tests I've also found a few new problems, and I have opened issues for them:
- FusionGroup seems to think that cat nodes broadcast their inputs (#14483)
- `prim::ConstantChunk` derivative formula doesn't handle undefined inputs (#14484)

This patch unfortunately deoptimizes some of our code (Fusion doesn't happen past chunk nodes, and outputs more tensors only because we have to get their size). I know how to fix those issues, but wanted to fix this terrible bug quickly.

cc zou3519 zdevito ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14485

Differential Revision: D13280899

Pulled By: soumith

fbshipit-source-id: 80cc5ec9331be80e1bb9ddfe85b81c2b997e0b0c
2018-12-03 13:44:18 -08:00
Michael Carilli
edb3ddf1a5 Accumulate grad fix (#14587)
Summary:
Rebased version of https://github.com/pytorch/pytorch/pull/13337.

I don't think the lint errors in the original PR had to do with files I touched, so hopefully the rebase fixes them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14587

Differential Revision: D13277428

Pulled By: soumith

fbshipit-source-id: f04c186b1dd4889b4250597eef87f9e9bf7b2426
2018-11-30 10:49:15 -08:00
Edward Yang
e35418b3be New implementations of DeviceGuard, StreamGuard and MultiStreamGuard (with CUDA specializations) (#13342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13342

This PR introduces a few new concepts:

- DeviceGuardImplInterface, and implementations for CPU and CUDA, which
  provide a generic interface for interfacing with device and stream state,
  without requiring a direct dependency on the code in question.
- InlineDeviceGuard, a general template for generating both specialized
  and dynamically dispatched device guard implementations.  Dynamic
  dispatch is done by specializing it on a VirtualGuardImpl.
- Provide a device-independent DeviceGuard class, which can be used even
  from CPU code. It uses the aforementioned dynamic dispatch.
- CUDA-specialized CUDAGuard class, which doesn't have a dynamic dispatch
  but can only be used from CUDA.
- StreamGuard, which is the same as above, but for streams rather than
  devices.
- Optional variants of all the aforementioned guards, which are a no-op if
  no device/stream is specified
- CUDAMultiStreamGuard, specifically for the case when we want to set
  a device on every guard.

There are some subtle semantic changes, which have been thoroughly documented
in the class definition.

BC-breaking changes:

- Move constructor/assignment have been removed from all device guard
  implementations.
- In some cases where you previously wrote 'set_device' (or 'set_stream'), you now must write
  'reset_device', because if you switch devices/device types, the stream/device on the
  previous device is unset.  This is different from previous behavior.
- CUDAGuard no longer handles streams, or multiple streams.  Use CUDAStreamGuard
  or CUDAMultiStreamGuard as appropriate for your use case.

Reviewed By: dzhulgakov

Differential Revision: D12849620

fbshipit-source-id: f61956256f0b12be754b3234fcc73c2abc1be04e
2018-11-11 12:11:10 -08:00
Peter Goldsborough
0479517325 Add modernize-* checks to clang-tidy (#13196)
Summary:
Enables almost all `modernize-*` checks in clang-tidy. This warns against things such as:

- Use of `const std::string&` instead of new-style `std::string` + move,
- Using old-style loops instead of range-for loops,
- Use of raw `new`
- Use of `push_back` instead of `emplace_back`
- Use of `virtual` together with `override` (`override` is sufficient)

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13196

Differential Revision: D12891837

Pulled By: goldsborough

fbshipit-source-id: 4d0f782a09eb391ee718d3d66f74c095ee121c09
2018-11-02 20:30:40 -07:00
Edward Yang
e5d56659ec Delete DeviceGuard(int64_t) constructor. (#13232)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13232

DeviceGuard should be device agnostic, which means that it shouldn't
assume that int64_t means select the CUDA device.

Reviewed By: gchanan

Differential Revision: D10858024

fbshipit-source-id: b40e8337e4046906fd8f83a95e6206367fb29dbe
2018-10-31 07:55:11 -07:00
Yangqing Jia
08aab4dfdd remove ATen/Error.h and ATen/core/Error.h (#12792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12792

This is a follow up diff after D10238910.

Only non-codemod change is the removal of ATen/Error.h and ATen/core/Error.h. Other files are basically changing the inclusion path + clang format for inclusion order.

Reviewed By: bddppq

Differential Revision: D10437824

fbshipit-source-id: 7f885f80ab5827468d1351cfb2765d0e3f555a69
2018-10-17 17:25:42 -07:00
Yangqing Jia
13cf39294d Remove ATen/Error.h and use ATen/core/Error.h instead. (#12132)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12132

TSIA. No code change involved.

Reviewed By: bwasti

Differential Revision: D10083237

fbshipit-source-id: bdab029015b9d0f1fa1f866c68aa5945cc68db9d
2018-09-27 10:11:17 -07:00
Christian Puhrsch
a9e6a673ae Remove caffe2::Tensor::capacity_nbytes, at::Tensor::to##name##Data, (#11876)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11876

Modern C++ api instead of macros, item() is aligned with Python frontend. caffe2::Tensor::capacity_nbytes is effecitvely unused and confusing w.r.t. caffe2::Tensor::nbytes().

codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCByte   "item<uint8_t>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCLong   "item<int64_t>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCInt    "item<int32_t>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCDouble "item<double>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCFloat  "item<float>"

codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toByteData   "data<uint8_t>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toLongData   "data<int64_t>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toIntData    "data<int32_t>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toDoubleData "data<double>"
codemod -d caffe2           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toFloatData  "data<float>"

codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCByte   "item<uint8_t>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCLong   "item<int64_t>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCInt    "item<int32_t>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCDouble "item<double>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCFloat  "item<float>"

codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toByteData   "data<uint8_t>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toLongData   "data<int64_t>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toIntData    "data<int32_t>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toDoubleData "data<double>"
codemod -d hphp           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toFloatData  "data<float>"

codemod -d caffe2 --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCComplexDouble "item<std::complex<double>>"

codemod -d tc           --extensions cc,cpp,cu,cuh,h,py,hpp,mm toCFloat  "item<float>"

Reviewed By: ezyang

Differential Revision: D9948572

fbshipit-source-id: 70c9f5390d92b82c85fdd5f8a5aebca338ab413c
2018-09-24 10:40:10 -07:00
Tongzhou Wang
7df6650e9c Fix empty embedding bag on cuda (#11740)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/11739
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11740

Differential Revision: D9881392

Pulled By: SsnL

fbshipit-source-id: 2964d314f199dd9b4bb69e36592b67efdf5e0760
2018-09-17 14:40:03 -07:00
Peter Goldsborough
dccd0f2de6 Bag of clang tidy fixes for torch/csrc/ and torch/csrc/autograd (#11050)
Summary:
Linting `torch/csrc/` (non-recursive) and `torch/csrc/autograd` (non-recursive).

Fixed things like:
- `typedef` vs `using`
- Use `.empty()` instead of comparing with empty string/using `.size() == 0`
- Use range for loops instead of old style loops (`modernize-`)
- Remove some `virtual` + `override`
- Replace `stdint.h` with `cstdint`
- Replace `return Type(x, y)` with `return {x, y}`
- Use boolean values (`true`/`false`)  instead of numbers (1/0)
- More ...

ezyang apaszke cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11050

Differential Revision: D9597505

Pulled By: goldsborough

fbshipit-source-id: cb0fb4793ade885a8dbf4b10484487b84c64c7f2
2018-09-05 19:55:50 -07:00
mruberry
9b1a65bec3 Extends type and shape tracing with device (#9796)
Summary:
This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for #8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.

The precise changes are:

- TypeAndShape -> InputMetadata, now includes device()
- Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
- The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
- Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
- (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings

fyi colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9796

Reviewed By: goldsborough

Differential Revision: D9119325

Pulled By: ezyang

fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de
2018-08-07 12:25:17 -07:00
Peter Goldsborough
04939a4745 Match parameter names and = default (#9737)
Summary:
More clang tidy cleanups in `torch/csrc`. This time:

1. `hicpp-use-equals-default` recommends `= default` instead of `{}` for constructors/destructors. This is better practice because it expresses the intent better (https://stackoverflow.com/questions/6502828/what-does-default-mean-after-a-class-function-declaration)
2. `readability-inconsistent-declaration-parameter-name` enforces that parameter names in the declaration match parameter names in the definition. This is just generally useful and can prevent confusion and bugs.

Also updated my script a little bit.

apaszke ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9737

Differential Revision: D9069069

Pulled By: goldsborough

fbshipit-source-id: f7b3f3a4eb4c9fadc30425a153566d3b613a41ae
2018-07-30 14:10:00 -07:00
Sam Gross
829d763c69 Implement add, sub, mul, div using TensorIterator (#8919)
Summary:
```
This adds TensorIterator, a helper class for computing element-wise
operations that's intended to replace the CPU and CUDA apply utils
functions.

CPU kernels are implemented as functions that operate on strided 1-d
tensors compared to CPUApplyUtils which operated individual elements. This
allows the kernels to handle vectorization, while TensorIterator handles
parallelization and non-coalesced dimensions.

GPU kernels continue to operate on elements, but the number of
specializations is reduced. The contiguous case remains the same. The
non-contiguous case uses a single (reduced) shape for all operands and
the fast integer division from THCIntegerDivider. To avoid extra
specializations for indexing with 64-bits, large operations are split
into smaller operations that can be indexed with 32-bits.

Major semantic changes:

 - No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by
   TensorIterator. The autograd engine performs the reduction assuming
   standard broadcasting if the gradient shape does not match the
   expected shape. Functions that do not use standard broadcasting rules
   should either continue to trace the expand calls or handle the
   reduction in their derivative formula.

 - Use ONNX v7, which supports broadcasting ops.

Performance impact:

 - Small increased fixed overhead (~0.5 us)
 - Larger overhead for wrapped numbers (~2.5 us)
 - No significant change for ops on contiguous tensors
 - Much faster worst-case performance for non-contiguous GPU tensors
 - Faster CPU bias addition (~2x)
 - Faster GPU bias addition (~30% faster)

Future work:

 - Decrease overhead, especially for wrapping numbers in Tensors
 - Handle general inter-type operations
 - Extend to unary ops and reductions
 - Use buffering for compute-bound operations on non-contiguous tensors
   (pull in from CPUApplyUtils)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919

Differential Revision: D8677600

Pulled By: colesbury

fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd
2018-07-27 14:43:24 -07:00
Mary McBreen
483ae8cb5d Replaces const ref with && for apply (#9175)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/5011
Tested with python test/test_autograd.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9175

Reviewed By: zdevito

Differential Revision: D8736377

Pulled By: marymcbreen

fbshipit-source-id: ff86f427f7b2cf0cab5912e7f32812bd0f49a712
2018-07-12 08:31:59 -07:00