Commit Graph

217 Commits

Author SHA1 Message Date
cyy
d0e4ca233e some reference and move fixes (#95942)
This PR introduces some modifications:
1. We find out some const function parameters that can be passed by reference and add the reference.
2. We find more opportunists of passing by value and change them accordingly.
3. Some use-after-move errors are fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95942
Approved by: https://github.com/Skylion007
2023-03-10 03:44:09 +00:00
Kazuaki Ishizaki
69aa6b4bb9 fix typo in comments under torch/csrc/autograd (#96061)
This PR fixes typos in comments of `.cpp` and `.h` files under `torch/csrc/autograd` directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96061
Approved by: https://github.com/soulitzer
2023-03-06 18:05:14 +00:00
Aaron Gokaslan
0247ed27cc Apply Clang-Tidy readability-container-size-empty (#93236)
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
2023-01-29 23:28:19 +00:00
cyy
f172feae0d More tidy fixes (#93069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069
Approved by: https://github.com/Skylion007
2023-01-27 06:40:50 +00:00
cyy
e292ddff4e More clang-tidy fixes (#92944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92944
Approved by: https://github.com/Skylion007
2023-01-25 19:11:51 +00:00
Kshiteej K
550f98332b [fix] vmap and anomaly mode interaction (#92672)
Fixes https://github.com/pytorch/functorch/issues/1049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92672
Approved by: https://github.com/albanD
2023-01-24 18:12:52 +00:00
Aaron Gokaslan
8c8cd9539d Add missing moves to torch autograd (#92772)
Applies some additional std::move functions to torch/csrc/autograd to opportunities that were found via static analysis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92772
Approved by: https://github.com/ezyang
2023-01-24 02:01:52 +00:00
soulitzer
a112814a7f Simplify retains grad hook implementation (#92604)
How the old retains_grad hooks was implemented:
- retains_grad hooks are stored on the autograd_meta, as entries in a vector
- upon registration, a wrapper hook CppFunctionTensorPreHook is created to wrap that vector, and then that wrapper hook is registered to the grad_fn, i.e., by appending it to a vector of retains_grad hooks on the grad_fn
- upon in-place, for the old grad_fn we set the retains_grad hook to nullptr, so that even though the old grad_fn still references the vector, the vector contains a single nullptr. For the new grad_fn, we create a new wrapper hook around the vector (storing the single retains_grad hook) on autograd_meta.

The new retains_grad hook implementation:
- we store std::function by value, and we store it on the grad_fn rather than the autograd_meta
- a single grad_fn can have multiple outputs, so it can potentially hold multiple retains_grad hooks. We use an unordered_map (previously a vector).
- on in-place we remove the hook from the old grad_fn and put it in the new grad_fn (small implication of this change is that  we we now need to have access to both the old grad_fn and new grad_fn, this isn't a problem)

Other details:
- CppFunctionTensorPreHook took a shared_ptr to vector of std::function. In our new implementation, we add a new wrapper hook CppFunctionSingleTensorPreHook, which takes a single std::function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92604
Approved by: https://github.com/albanD
2023-01-23 20:10:46 +00:00
soulitzer
9db4323e4c Deprecate capture hooks except distributed use case (#92653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92653
Approved by: https://github.com/albanD
2023-01-20 20:51:46 +00:00
soulitzer
1bc60c6b31 [reland] Improve hooks ordering behavior (#92559)
This reverts commit e525f433e1.

Original PR:  #85849
Fixes #ISSUE_NUMBER

In addition to reverting the revert, this PR:
- defines the virtual destructor of FunctionPreHook in the header. Why? Presumably the internal build imports the header from somewhere, but does not have function_hooks.cpp (where the virtual destructor was previously defined) in the same compilation unit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92559
Approved by: https://github.com/albanD
2023-01-19 08:17:32 +00:00
PyTorch MergeBot
e525f433e1 Revert "Improve hooks ordering behavior (#85849)"
This reverts commit 049838f249.

Reverted https://github.com/pytorch/pytorch/pull/85849 on behalf of https://github.com/albanD due to fails internal build
2023-01-18 15:27:22 +00:00
soulitzer
049838f249 Improve hooks ordering behavior (#85849)
Addresses: https://github.com/pytorch/pytorch/issues/35802

Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit#

### Changes in this PR

#### Implementation
- We have now have 3 fields: pre_hooks, retains_grad_hooks, and tensor_pre_hooks so that we can more precisely define their ordering and when they are executed.
- Since retains grad uses an entirely new field, we cannot reuse the old retains grad, logic. We refactor retains grad to call directly into the variable.cpp logic. Other logic in variable.cpp that handle cpp hooks must also be updated.

#### Hooks ordering and execution:
- Defines pre-hooks registered on tensor to run before pre-hooks registered on grad_fn
- Updates pre-hooks registered on tensor to always run, even if they are the inputs= to .grad()
- Post hooks (and pre hooks) can now observe the modifications to gradient by the tensor pre hook

#### Retains grad hooks
- retains grad hooks always execute last, even if there are other tensor pre-hooks registered

#### Unchanged:
- pre_hooks registered to grad_fn aren't expected to execute if they are the inputs= to .grad()

Follow ups:
- simplify retains_grad field to not be a vector, since it always holds a single hook
- potentially merge capture hooks with tensor pre hooks, this would involve some additional refactoring since
- python hooks registered to tensor behavior on in-place is still wrong

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85849
Approved by: https://github.com/albanD
2023-01-17 16:23:21 +00:00
cyy
9b716a0682 Clean up more clang-tidy supression (#92203)
1. remove unused NOLINTNEXTLINE(performance-move-const-arg)
2. add more std::move

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92203
Approved by: https://github.com/Skylion007
2023-01-17 05:43:08 +00:00
Aidyn-A
b95e1d76a8 [CUDA12] Conditionally set device in autograd engine (#91191)
CUDA 12 introduces behavioral changes in `cudaSetDevice`. In the old version it would just set the device to be used for kernel launches and memory allocations without creating a CUDA context. Now, in CUDA 12, every time `cudaSetDevice` is called for the first time it creates a CUDA context. See issue #91122.

The autograd engine iterates over all devices and sets them:
f8b348c1fc/torch/csrc/autograd/engine.cpp (L1399-L1402)
f8b348c1fc/torch/csrc/autograd/engine.cpp (L349)

Which causes pollution of CUDA contexts on sibling devices.
This PR introduces a workaround this issue by conditionally setting the device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91191
Approved by: https://github.com/ngimel
2022-12-22 19:54:45 +00:00
Nikita Shulga
621158cd7f [BE] Do not assign string literal to char * (#87949)
Not sure, what I was thinking when writing something like:
```
auto foo = std::getenv("BAR");
if (!foo) {
   foo = "baz";
}
```
as `std::getenv` return `char *` (i.e. mutable string), but string literals are immutable. (i.e. `const char *`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87949
Approved by: https://github.com/kit1980
2022-10-30 01:04:55 +00:00
soulitzer
adb76ef510 Expose API for backward execution order (#87507)
In this PR:
- graph_task stores graph roots on construction so that we can later traverse through the graph
- before the nodes are returned, they needed to be converted from raw_ptr to shared_ptr, and this should be OK because the graph is guaranteed to be alive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87507
Approved by: https://github.com/albanD
2022-10-26 21:28:45 +00:00
Kurt Mohler
1dbc8ad3b7 Add Warning class and refactor C++ warnings to use it (#84101)
Also adds `TORCH_WARN_WITH` and `TORCH_WARN_DEPRECATION` macros

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84101
Approved by: https://github.com/albanD
2022-10-18 20:02:42 +00:00
soulitzer
ba3fde6aa0 Add multi-grad hooks (#86260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86260
Approved by: https://github.com/albanD
2022-10-07 21:16:45 +00:00
Pearu Peterson
88b882cd1c Support sum on a sparse COO tensor. (#86300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86300
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:28 +00:00
Pearu Peterson
f104490d63 Support autograd on Linear with sparse compressed weight. (#86137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86137
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:25 +00:00
Elias Ellison
d04889323e Add Context Manager for Disabling Multithreading in Backwards, use in aot autograd (#86245)
We were running into a few issues with running multithreaded backwards in aot_autograd: such as https://github.com/pytorch/pytorch/issues/86136, and `FakeTensorMode` getting into a weird state as a result of not executing functions completely sequentially. The multithreaded backwards is lost in translation when we trace out the backwards anyway, and adds a lot of additional complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86245
Approved by: https://github.com/albanD, https://github.com/yf225
2022-10-06 03:27:42 +00:00
soulitzer
a876432aea Expose torch._will_engine_execute_node (#84773)
Addresses: https://github.com/pytorch/pytorch/issues/83617

This PR a way to query the TLS graph task's exec_info which is a map mapping the Node to a bool indicating whether it will be executed in the current backward pass (as determined by the inputs= argument for .grad of .backward).
- this works with both custom Function nodes and normal codegened nodes
-  to be able to verify whether the pyobject passed is an actual node, we now store pointers to PyTypeObjects into a set on registration.
- error out when .backward without inputs= to avoid silently returning True

Alternatives:
- not sure if it is possible to bind to Python from a raw pointer to Node. At least we wouldn't be able to use existing logic, and the Python object should only hold a weak reference to the Node.
- other solutions to the motivating issue seem to require more extensive modification to the engine

See the issue linked for an example of usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84773
Approved by: https://github.com/albanD
2022-09-28 20:13:52 +00:00
soulitzer
31fad3926a Add option to run anomaly mode without nan checking (#83481)
Fixes https://github.com/pytorch/pytorch/issues/83117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83481
Approved by: https://github.com/albanD
2022-08-16 22:56:23 +00:00
Horace He
ea51e87b52 Added list clearing codegen to AOTAutograd (hidden behind config.aot_clear_list (#83137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83137
Approved by: https://github.com/jansel, https://github.com/albanD
2022-08-12 22:52:16 +00:00
richard
382ef1fda7 Autograd graphtask trim unnecessary edges (#82544)
### Introduction
<!-- What did you change and why was it needed? -->

Removing unnecessary weight gradient calculation is very important for applications that need high-order derivatives during training. However, this is not supported by the current Autograd engine.

For more detail: The backward function of a `matmul` operator (e.g., `linear` `addmm` `mm`), has two matmuls, one for `input gradient` and another for `weight gradient`. For a typical neural network (nn) with a few linear layers and activation functions, if the user calls `torch.autograd.grad()` to calculate the derivative of the nn output `y` w.r.t the nn input `x`,  only the `input gradient` of the `matmul` operator is needed, and the `weight gradient` is discarded. However, the current PyTorch autograd engine will always calculate the `weight gradient` if `weight` requires gradient (the calculation of the high-order derivative is performed during training).

The figure attached shows the autograd graph of the following code snippet:
```py
y = torch.nn.functional.linear(x, weight, bias)
y = y.pow(2)
# first order derivative
y__x, = torch.autograd.grad(y, x, grad_outputs=grad_outputs, create_graph=True)
# first order derivative
y__x__x, = torch.autograd.grad(y__x, x, grad_outputs=grad_outputs, create_graph=True)
```
The path with  is not needed when calculating derivatives.

<img width="50%" alt="image" src="https://user-images.githubusercontent.com/9999318/182018117-719c5a23-bcc6-4a63-8e8d-1bca3ebda2e3.png">

### Issue
<!-- Link to Issue ticket or RFP -->
Related issue: https://github.com/pytorch/pytorch/issues/56500

### Method
When calling `torch.autograd.grad`, `exec_info_` is created for each GraphTask, which allows filtering paths on the graph that are not needed. However, when the GraphTask calls into the node, the node still does not know whether the edges are needed or not. In the case of matmul, `weight.requires_grad is True` so the weight gradient is always calculated.

Following https://github.com/pytorch/pytorch/issues/56500#issuecomment-825694656, this PR passes the graph task's thread_local `exec_info_` into the node, so it could trim unnecessary edges during `torch.autograd.grad` calls.

### Benchmark
Benchmark script: https://gist.github.com/yueyericardo/24158433a2021c51eeef9c3e2722df99

Benchmark result:
6 hidden layers, batch size 10000, on A100

FP32 result
| hessian benchmark             | FP32 (before) | FP32 (After)      | FP32 (Functorch v0.1.1) |
| ----------------------------- | ------------- | ----------------- | ----------------------- |
| Linear + ReLU (no backward)   | 55.658 ms     | 29.392 ms (1.90X) | 29.547 ms (1.90X)       |
| Linear + ReLU (with backward) | 81.173 ms     | 54.917 ms (1.47X) | 68.988 ms (1.18X)       |

TF32 result
| hessian benchmark             | TF32 (before) | TF32 (after)      | TF32 (Functorch v0.1.1) |
| ----------------------------- | ------------- | ----------------- | ----------------------- |
| Linear + ReLU (no backward)   | 19.801 ms     | 11.259 ms (1.76X) | 10.754 ms (1.84X)       |
| Linear + ReLU (with backward) | 29.167 ms     | 20.466 ms (1.42X) | 22.784 ms (1.28X)       |

For FP32 result, we could get 1.9X speed up for hessian calculation, and 1.47X speed up during training, which is even faster than functorch `vmap(jacfwd(jacrev` implementation. (functorch has performance regression on v0.2.0, https://github.com/pytorch/functorch/issues/989, so we are using v0.1.1 for benchmark)

@zou3519 does functorch also includes similar optimizations during hessian calculation? If not, what do we need to do so the functorch could also benefit from this PR?

### Testing
<!-- How did you test your change? -->

- [x] we need to figure out a way for unittest

### Thanks
Thanks for the great blog: [How Computational Graphs are Executed in PyTorch | PyTorch](https://pytorch.org/blog/how-computational-graphs-are-executed-in-pytorch/)

cc @zasdfgbnm @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82544
Approved by: https://github.com/soulitzer
2022-08-11 18:50:09 +00:00
Jeff Daily
7d031db4a5 move ROCmBackwardPassGuard from autograd engine.cpp to function.h (#82187)
This moves the ROCmBackwardPassGuard back to its previous, verified location.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82187
Approved by: https://github.com/albanD
2022-07-26 22:33:18 +00:00
jjsjann123
9e86796fe3 simple c10 implementation for std::call_once (#78051)
A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146
It could hang during re-entry after an exception handling.

Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051
Approved by: https://github.com/albanD
2022-06-28 15:47:03 +00:00
drisspg
f9656817df Add nested tensor support to autograd (#79446)
The issue that is tracking this work is: #79447

This is one in a series of PRs to add autograd support for nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79446
Approved by: https://github.com/soulitzer
2022-06-16 21:09:17 +00:00
Michael Suo
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
drisspg
10dfdf2066 engine changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79296

Approved by: https://github.com/soulitzer
2022-06-11 00:07:27 +00:00
Jeff Daily
de86146c61 rocblas alt impl during backward pass only (#71881)
In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing.  The scope-based helper class `BackwardPassGuard` is provided to toggle state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881
Approved by: https://github.com/albanD
2022-05-18 19:42:58 +00:00
Scott Wolchok
52af4fc5ba [PyTorch] Make RecordFunction store inputs as ArrayRef (#72484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72484

Stepping stone toward stack-allocating array of inputs.

Funnily enough, this seems to improve performance too.
ghstack-source-id: 155492056

Test Plan:
1) CI
2) framework overhead benchmark with --stressTestRecordFunction --captureRecordFunctionInputs goes from 0.76 usec/iter to 0.72.

Reviewed By: chaekit, robieta

Differential Revision: D34061169

fbshipit-source-id: 073fedf1d3d162f927c4e9867cfda7dbfabba215
(cherry picked from commit dae77cf1cd8813d902d73999ad97133a3ef8e291)
2022-05-05 21:38:42 +00:00
Hannes Friederich
5932c37198 [caffe2] drop XROS ports (#76366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76366

caffe2 is not currently being built for XROS.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D35923922

fbshipit-source-id: 260dacadf0bd5b6bab7833a4ce81e896d280b053
(cherry picked from commit 8370b8dd2519d55a79fa8d45e7951ca8dc0b21a8)
2022-04-26 23:54:22 +00:00
Ivan Yashchuk
bba4780232 Enable autograd wrt sparse CSR tensors
This pull request enables accumulating gradients for the CSR tensor.
Functions that work and are tested:
- tensor.abs()
- tensor.neg()
- tensor.conj_physical()
- torch.addmm

`torch.mm` also works, but tests will be added later.

In addition, this PR adds throwing an error when trying to access strides, storage, and contiguity info on a CSR tensor.

`tensor.to_sparse_csr().to_sparse_csr()` was failing and now fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75435
Approved by: https://github.com/cpuhrsch
2022-04-19 18:42:45 +00:00
Alban Desmaison
4d04ef62a1 Allow forking until a worker thread is created in autograd engine (#72689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72689

Fix https://github.com/pytorch/pytorch/issues/69839

Should we add a private python binding to check if the bad fork guard has been set and add test in CI to make sure that it is never set on our CPU-only CI build? Not sure how flaky that will be out of CI for people that run CPU build on a machine that cuda installed...
EDIT: turns out, we already had such tests in test_multiprocessing. So should be tested and enforced now!

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D34180243

Pulled By: albanD

fbshipit-source-id: 3284db52dcf4568362244b60e3c5657153e64fa4
(cherry picked from commit 6e23f7a33a)
2022-02-12 01:52:57 +00:00
Alban Desmaison
a877441494 Clean up use of cpu ready queue in autograd engine (#72688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72688

Refactor how we know what to run on the cpu queue.
The Lazy Tensor moved there as it is always present as a device guard and would make the number of devices 1 all the time (forcing the creation of a thread).

FYI wconstab  you most likely don't care about this unless you ever use multiple Lazy device?
This should slightly improve the perf if you run backward with Lazy Tensors as the work will be done in the main thread and not a worker thread.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D34180245

Pulled By: albanD

fbshipit-source-id: 88c5d5bdd631ad01bf271d720d1eab69aba84fc0
(cherry picked from commit da7e9b902f)
2022-02-12 01:52:56 +00:00
Alban Desmaison
1312524759 Remove un-used function in autograd engine (#72687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72687

Not sure who would use that. It is not used in the code base as far as I can see. And I don't know of anyone working with the engine directly out of tree. So tentatively removing it.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D34180244

Pulled By: albanD

fbshipit-source-id: 678ba1c4a1cbd9a0458d33be97664d1e3d1bd86b
(cherry picked from commit 3968ca3a38)
2022-02-12 01:52:56 +00:00
rraminen
07ca1fc88b remove hasPrimaryContext workaround on ROCm (#71146)
Summary:
As issue https://github.com/pytorch/pytorch/issues/59750 is fixed, this PR is to remove the workaround implemented for it on ROCm.

Enabled hasPrimaryContext() related PyTorch unit tests on ROCm.

cc: amathews-amd, jithunnair-amd

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71146

Reviewed By: anjali411

Differential Revision: D33754615

Pulled By: albanD

fbshipit-source-id: b3a5c65a20c6d52d5f2ffc9e6f9628c819329b5d
(cherry picked from commit cfdd12166c)
2022-01-25 20:32:12 +00:00
Peter Bell
b08d64202a Remove THGeneral (#69041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041

`TH_CONCAT_{N}` is still being used by THP so I've moved that into
it's own header but all the compiled code is gone.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872477

Pulled By: ngimel

fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a
2021-12-13 16:14:28 -08:00
Veselin Petrov
065018d812 [pytorch][xros] Ensure all pytorch mobile operators build ok in XROS mode (#68266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68266
* Use `if...endif` to adjust pyTorch internals towards XROS

Test Plan: CI

Reviewed By: kkosik20

Differential Revision: D32190771

fbshipit-source-id: cce073dea53c2b5681d913321101cd83c6472019
2021-11-15 19:52:45 -08:00
Peter Bell
5f45927d15 Autograd: Delay warnings until the end of backward execution (#66235)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50209

This adds a new warning handler that stores all warnings in a shared
queue, which can be "replayed" at a later time and, crucially, on
another thread. Then, I use this inside the autograd engine to ensure
that warnings are processed by the handler registered on the main
thread.

For testing, I also add an operator that always warns in the backward
pass and test that the warning is a normal Python warning.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66235

Reviewed By: ejguan

Differential Revision: D31505413

Pulled By: albanD

fbshipit-source-id: 1a7f60b038f55c20591c0748b9e86735b3fec2f9
2021-10-13 15:38:04 -07:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Alban Desmaison
b858993c97 Fix engine check for case where grad is a subclass (#65568)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65568

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31158089

Pulled By: albanD

fbshipit-source-id: 2a77df9b6340107de02a043b57a36cb7ae68df34
2021-09-24 08:41:19 -07:00
anjali411
158393e1a1 Fix autograd engine checks and update InputMetadata (#65235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235

1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
     a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
         python key set on it (tensor subclass).

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31117318

Pulled By: anjali411

fbshipit-source-id: 825401df98695c48bf9b320be54585f6aff500bd
2021-09-22 11:01:19 -07:00
Brian Hirsh
152f0236c3 Revert D31082693: Fix autograd engine checks and update InputMetadata
Test Plan: revert-hammer

Differential Revision:
D31082693 (9324d682fd)

Original commit changeset: cb551cd438c6

fbshipit-source-id: fc60f86b80fc70058984df6bccbf240d27f5843e
2021-09-22 10:00:08 -07:00
anjali411
9324d682fd Fix autograd engine checks and update InputMetadata (#65235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235

1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
     a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
         python key set on it (tensor subclass).

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D31082693

Pulled By: anjali411

fbshipit-source-id: cb551cd438c6ca40b0f18a4d0009e0861cf0fd4e
2021-09-22 07:49:52 -07:00
Rohan Varma
421d8f86b6 Add a record scope around autograd::engine::evaluate_function (#63619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63619

Adds a RECORD_FUNCTION with the function that is being valuate as part
of backwards execution. This has been useful in picking up some operations
in the backwards pass that otherwise would not show up, for example custom cpp
functions that use custom C++ code.
ghstack-source-id: 137041723

Test Plan:
CI

benchmark:
buck run mode/opt //scripts/rvarm1/ddp:bench

Reviewed By: albanD

Differential Revision: D30439492

fbshipit-source-id: 955917770cdf2a2edb0303223ace710b668ba388
2021-09-01 12:32:30 -07:00
albanD
733755f72c remove special grad_mode tls handling (#63116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63116

This PR removes the special flag to disable grad mode tracking on the ThreadLocalState and replaces it with an explicit setter that users can use.
This allows to reduce complexity of ThreadLocalState.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388098

Pulled By: albanD

fbshipit-source-id: 85641b3d711179fb78ff6a41ed077548dc821a2f
2021-08-26 07:51:30 -07:00
albanD
ab954cb0d1 clean up engine.cpp thread state (#63115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63115

This actually changes:
- callbacks now run with proper grad mode even in worker threads
- graphtask's Future callbacks now run with proper TLS when erroring
  out from a worker thread

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388100

Pulled By: albanD

fbshipit-source-id: 7ae9c461c2f0040548dd9e1e314f25e8da0c2e67
2021-08-25 11:08:43 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00