Commit Graph

188 Commits

Author SHA1 Message Date
drisspg
10dfdf2066 engine changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79296

Approved by: https://github.com/soulitzer
2022-06-11 00:07:27 +00:00
Jeff Daily
de86146c61 rocblas alt impl during backward pass only (#71881)
In preparation of adopting future rocblas library options, it is necessary to track when the backward pass of training is executing.  The scope-based helper class `BackwardPassGuard` is provided to toggle state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71881
Approved by: https://github.com/albanD
2022-05-18 19:42:58 +00:00
Scott Wolchok
52af4fc5ba [PyTorch] Make RecordFunction store inputs as ArrayRef (#72484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72484

Stepping stone toward stack-allocating array of inputs.

Funnily enough, this seems to improve performance too.
ghstack-source-id: 155492056

Test Plan:
1) CI
2) framework overhead benchmark with --stressTestRecordFunction --captureRecordFunctionInputs goes from 0.76 usec/iter to 0.72.

Reviewed By: chaekit, robieta

Differential Revision: D34061169

fbshipit-source-id: 073fedf1d3d162f927c4e9867cfda7dbfabba215
(cherry picked from commit dae77cf1cd8813d902d73999ad97133a3ef8e291)
2022-05-05 21:38:42 +00:00
Hannes Friederich
5932c37198 [caffe2] drop XROS ports (#76366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76366

caffe2 is not currently being built for XROS.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D35923922

fbshipit-source-id: 260dacadf0bd5b6bab7833a4ce81e896d280b053
(cherry picked from commit 8370b8dd2519d55a79fa8d45e7951ca8dc0b21a8)
2022-04-26 23:54:22 +00:00
Ivan Yashchuk
bba4780232 Enable autograd wrt sparse CSR tensors
This pull request enables accumulating gradients for the CSR tensor.
Functions that work and are tested:
- tensor.abs()
- tensor.neg()
- tensor.conj_physical()
- torch.addmm

`torch.mm` also works, but tests will be added later.

In addition, this PR adds throwing an error when trying to access strides, storage, and contiguity info on a CSR tensor.

`tensor.to_sparse_csr().to_sparse_csr()` was failing and now fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75435
Approved by: https://github.com/cpuhrsch
2022-04-19 18:42:45 +00:00
Alban Desmaison
4d04ef62a1 Allow forking until a worker thread is created in autograd engine (#72689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72689

Fix https://github.com/pytorch/pytorch/issues/69839

Should we add a private python binding to check if the bad fork guard has been set and add test in CI to make sure that it is never set on our CPU-only CI build? Not sure how flaky that will be out of CI for people that run CPU build on a machine that cuda installed...
EDIT: turns out, we already had such tests in test_multiprocessing. So should be tested and enforced now!

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D34180243

Pulled By: albanD

fbshipit-source-id: 3284db52dcf4568362244b60e3c5657153e64fa4
(cherry picked from commit 6e23f7a33a)
2022-02-12 01:52:57 +00:00
Alban Desmaison
a877441494 Clean up use of cpu ready queue in autograd engine (#72688)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72688

Refactor how we know what to run on the cpu queue.
The Lazy Tensor moved there as it is always present as a device guard and would make the number of devices 1 all the time (forcing the creation of a thread).

FYI wconstab  you most likely don't care about this unless you ever use multiple Lazy device?
This should slightly improve the perf if you run backward with Lazy Tensors as the work will be done in the main thread and not a worker thread.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D34180245

Pulled By: albanD

fbshipit-source-id: 88c5d5bdd631ad01bf271d720d1eab69aba84fc0
(cherry picked from commit da7e9b902f)
2022-02-12 01:52:56 +00:00
Alban Desmaison
1312524759 Remove un-used function in autograd engine (#72687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72687

Not sure who would use that. It is not used in the code base as far as I can see. And I don't know of anyone working with the engine directly out of tree. So tentatively removing it.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D34180244

Pulled By: albanD

fbshipit-source-id: 678ba1c4a1cbd9a0458d33be97664d1e3d1bd86b
(cherry picked from commit 3968ca3a38)
2022-02-12 01:52:56 +00:00
rraminen
07ca1fc88b remove hasPrimaryContext workaround on ROCm (#71146)
Summary:
As issue https://github.com/pytorch/pytorch/issues/59750 is fixed, this PR is to remove the workaround implemented for it on ROCm.

Enabled hasPrimaryContext() related PyTorch unit tests on ROCm.

cc: amathews-amd, jithunnair-amd

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71146

Reviewed By: anjali411

Differential Revision: D33754615

Pulled By: albanD

fbshipit-source-id: b3a5c65a20c6d52d5f2ffc9e6f9628c819329b5d
(cherry picked from commit cfdd12166c)
2022-01-25 20:32:12 +00:00
Peter Bell
b08d64202a Remove THGeneral (#69041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041

`TH_CONCAT_{N}` is still being used by THP so I've moved that into
it's own header but all the compiled code is gone.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872477

Pulled By: ngimel

fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a
2021-12-13 16:14:28 -08:00
Veselin Petrov
065018d812 [pytorch][xros] Ensure all pytorch mobile operators build ok in XROS mode (#68266)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68266
* Use `if...endif` to adjust pyTorch internals towards XROS

Test Plan: CI

Reviewed By: kkosik20

Differential Revision: D32190771

fbshipit-source-id: cce073dea53c2b5681d913321101cd83c6472019
2021-11-15 19:52:45 -08:00
Peter Bell
5f45927d15 Autograd: Delay warnings until the end of backward execution (#66235)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50209

This adds a new warning handler that stores all warnings in a shared
queue, which can be "replayed" at a later time and, crucially, on
another thread. Then, I use this inside the autograd engine to ensure
that warnings are processed by the handler registered on the main
thread.

For testing, I also add an operator that always warns in the backward
pass and test that the warning is a normal Python warning.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66235

Reviewed By: ejguan

Differential Revision: D31505413

Pulled By: albanD

fbshipit-source-id: 1a7f60b038f55c20591c0748b9e86735b3fec2f9
2021-10-13 15:38:04 -07:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Alban Desmaison
b858993c97 Fix engine check for case where grad is a subclass (#65568)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65568

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31158089

Pulled By: albanD

fbshipit-source-id: 2a77df9b6340107de02a043b57a36cb7ae68df34
2021-09-24 08:41:19 -07:00
anjali411
158393e1a1 Fix autograd engine checks and update InputMetadata (#65235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235

1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
     a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
         python key set on it (tensor subclass).

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31117318

Pulled By: anjali411

fbshipit-source-id: 825401df98695c48bf9b320be54585f6aff500bd
2021-09-22 11:01:19 -07:00
Brian Hirsh
152f0236c3 Revert D31082693: Fix autograd engine checks and update InputMetadata
Test Plan: revert-hammer

Differential Revision:
D31082693 (9324d682fd)

Original commit changeset: cb551cd438c6

fbshipit-source-id: fc60f86b80fc70058984df6bccbf240d27f5843e
2021-09-22 10:00:08 -07:00
anjali411
9324d682fd Fix autograd engine checks and update InputMetadata (#65235)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65235

1. Updated the legacy type checks in `torch/csrc/autograd/engine.cpp` to individually validate the dtype, device, and layout equality for grad and tensor.
2. Removed device field from `InputMetadata` since it's already stored via storing options. Also, added `dtype()` and `layout()` methods to `InputMetadata`. To make this change, some calls had to be updated due to the change in constructor.
3. To fix https://github.com/pytorch/pytorch/issues/65016:
     a. Added a `is_tensor_subclass` field in `InputMetadata` to skip device checks for grad and tensor when the tensor has
         python key set on it (tensor subclass).

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D31082693

Pulled By: anjali411

fbshipit-source-id: cb551cd438c6ca40b0f18a4d0009e0861cf0fd4e
2021-09-22 07:49:52 -07:00
Rohan Varma
421d8f86b6 Add a record scope around autograd::engine::evaluate_function (#63619)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63619

Adds a RECORD_FUNCTION with the function that is being valuate as part
of backwards execution. This has been useful in picking up some operations
in the backwards pass that otherwise would not show up, for example custom cpp
functions that use custom C++ code.
ghstack-source-id: 137041723

Test Plan:
CI

benchmark:
buck run mode/opt //scripts/rvarm1/ddp:bench

Reviewed By: albanD

Differential Revision: D30439492

fbshipit-source-id: 955917770cdf2a2edb0303223ace710b668ba388
2021-09-01 12:32:30 -07:00
albanD
733755f72c remove special grad_mode tls handling (#63116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63116

This PR removes the special flag to disable grad mode tracking on the ThreadLocalState and replaces it with an explicit setter that users can use.
This allows to reduce complexity of ThreadLocalState.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388098

Pulled By: albanD

fbshipit-source-id: 85641b3d711179fb78ff6a41ed077548dc821a2f
2021-08-26 07:51:30 -07:00
albanD
ab954cb0d1 clean up engine.cpp thread state (#63115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63115

This actually changes:
- callbacks now run with proper grad mode even in worker threads
- graphtask's Future callbacks now run with proper TLS when erroring
  out from a worker thread

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30388100

Pulled By: albanD

fbshipit-source-id: 7ae9c461c2f0040548dd9e1e314f25e8da0c2e67
2021-08-25 11:08:43 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Richard Barnes
a91be24e2d Modernize make pointers (#61741)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61741

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717385

fbshipit-source-id: 4452b77981e49175f744bdaab12cd225bf75b90e
2021-07-22 15:54:37 -07:00
johnlu
f59ac5abc8 Add thread local state guards in autograd engine hooks. (#60067)
Summary:
The thread local state of backward thread is not aligned to the GraphTask's `thread_local_` when calling the hooks in backward.

This is required for profiling the statistics c10d operation of `DistributedDataParallel` module.

Is there any concern to add the thread local state guard when calling the hooks in backward? ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60067

Reviewed By: ezyang

Differential Revision: D29654599

Pulled By: albanD

fbshipit-source-id: 656c4f91017184fd40f1a184de24757a13387e37
2021-07-20 07:41:49 -07:00
Michael Carilli
2fa6c7627e [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: albanD

Differential Revision: D29370344

Pulled By: ngimel

fbshipit-source-id: 3248bc5fb92fc517db0c15c897e5d7250f67d7fe
2021-06-24 17:34:02 -07:00
Luca Wehrstedt
bb9e1150ea Revert D29342234: [pytorch][PR] [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream
Test Plan: revert-hammer

Differential Revision:
D29342234 (675cea1adb)

Original commit changeset: 98e6be7fdd85

fbshipit-source-id: 84022973248b2254210eee57402df2c4f4bc43c6
2021-06-24 04:49:28 -07:00
Michael Carilli
675cea1adb [CUDA graphs][BC-breaking] Removes post-backward syncs on default stream (#60421)
Summary:
Before https://github.com/pytorch/pytorch/pull/57833, calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
```python
with torch.cuda.stream(s):
    # imagine forward used many streams, so backward leaf nodes may run on many streams
    loss.backward()
# no sync
use grads
```

but a more benign-looking pattern was unsafe:
```python
with torch.cuda.stream(s):
    # imagine forward used a lot of streams, so backward leaf nodes may run on many streams
    loss.backward()
    # backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
    # so counterintuitively (even though we're in the same stream context as backward()!)
    # it is NOT SAFE to use grads here, and there's no easy way to make it safe,
    # unless you manually sync on all the streams you used in forward,
    # or move "use grads" back to default stream outside the context.
    use grads
```
mruberry ngimel and I decided backward() should have the [same user-facing stream semantics as any cuda op](https://pytorch.org/docs/master/notes/cuda.html#stream-semantics-of-backward-passes).** In other words, the weird pattern should be unsafe, and the benign-looking pattern should be safe. Implementationwise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams.

After https://github.com/pytorch/pytorch/pull/57833, backward syncs the calling thread's current stream AND default stream with all leaf streams at the end of backward. The default stream syncs were retained for temporary backward compatibility.

This PR finishes https://github.com/pytorch/pytorch/pull/57833's work by deleting syncs on the default stream.

With this PR, graph-capturing an entire backward() call should be possible (see the [test_graph_grad_scaling diffs](https://github.com/pytorch/pytorch/compare/master...mcarilli:streaming_backwards_remove_default_syncs?expand=1#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3641-R3642)).

** first paragraph has a formatting error which this PR should also fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60421

Reviewed By: VitalyFedyunin, albanD

Differential Revision: D29342234

Pulled By: ngimel

fbshipit-source-id: 98e6be7fdd8550872f0a78f9a66cb8dfe75abf63
2021-06-23 23:35:24 -07:00
Michael Carilli
56481f9762 Ensure proper syncs for out-of-place grad creation (torch.autograd.grad) when backward ops run on side streams (#60127)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59844.

Streaming backwards collects "leaf streams" for AccumulateGrad functions that stash or accumulate .grad attributes for autograd leaf tensors, and syncs those streams with some ambient stream(s) so later ops can safely consume the grads on the ambient stream(s).

But, currently, streaming backwards does not collect leaf streams for grads produced out-of-place (ie, not stashed onto a .grad attribute) by `torch.autograd.grad`, because these out-of-place grads are "captured" and returned before they reach an AccumulateGrad function. Some out-of-place grads might not even have an AccumulateGrad function to go to, because `torch.autograd.grad` can be told to make grads for non-leaf temporaries.[1]

The upshot is, when streaming backwards makes ops that produce out-of-place gradients run on side streams, no ambient stream is told to sync on these side streams, so `torch.autograd.grad` doesn't offer the same post-call safe-use guarantees for grads as the leaf accumulation of `torch.autograd.backward`.

This PR ensures `torch.autograd.grad` gives the same safe-use guarantees as `torch.autograd.backward` by also stashing leaf streams for grads created out-of-place.

I augmented a streaming backwards test to include a torch.autograd.grad attempt. The test fails on current master[2] and passes with the engine.cpp diffs.

I have no idea if this bug or its fix matter to distributed autograd. pritamdamania mrshenli should take a look before it's merged.

[1] example:
```python
leaf = torch.tensor(..., requires_grad=True)
tmp = leaf * 2
loss = tmp.sum()
torch.autograd.grad(loss, inputs=(tmp, leaf))
```
Technically, because `torch.autograd.grad` can be told to produce grads for non-leaf temporaries, these streams might NOT be "leaf streams". Maybe I should rename `leaf_streams`?

[2] the way the test currently fails is fun: it reports
```
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 25) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (5.0 vs. 5.0), which occurred at index (0, 0).
```
I suspect this [kafka trap](https://en.wiktionary.org/wiki/Kafkatrap) happens because assertEqual does a comparison test on the device, syncs on some bool result, sees failure and prints the tensors post-sync at which point is IS safe to access the values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60127

Reviewed By: mrshenli

Differential Revision: D29276581

Pulled By: albanD

fbshipit-source-id: a9f797e2fd76e2f884cce5a32ecf5d9b704c88ee
2021-06-23 07:14:01 -07:00
Michael Dagitses
91451369ed require non-empty inputs to grad() calls in the API (#52016)
Summary:
The grad() function needs to return the updated values, and hence
needs a non-empty inputs to populate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52016

Test Plan:
Passes Python and C++ unit tests, and added new tests to catch this behavior.

Fixes https://github.com/pytorch/pytorch/issues/47061

Reviewed By: albanD

Differential Revision: D26406444

Pulled By: dagitses

fbshipit-source-id: 023aeca9a40cd765c5bad6a1a2f8767a33b75a1a
2021-06-22 10:10:58 -07:00
Michael Carilli
be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
Richard Barnes
e3d75b8475 irange for PyTorch sans jit (#59481)
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.

Generated with D28874212.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28909681

fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
2021-06-09 14:46:11 -07:00
Jeffrey Wan
1733d10399 Warn when backward() is called with create_graph=True (#59412)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4661
- Add warnings in engine's `execute` function so it can be triggered through both cpp and python codepaths
- Adds an RAII guard version of `c10::Warning::set_warnAlways` and replaces all prior usages of the set_warnAlways with the new one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59412

Reviewed By: jbschlosser

Differential Revision: D28969294

Pulled By: soulitzer

fbshipit-source-id: b03369c926a3be18ce1cf363b39edd82a14245f0
2021-06-08 17:19:04 -07:00
Richard Barnes
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
David Reiss
9c83e4160d Use some c10::ThreadLocal to avoid crashes on old Android toolchains (#59017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59017

See the comment in ThreadLocal.h for context.
I used a slightly dirty preprocessor hack to minimize the number of changes.
The hope is that we'll be able to revert all of these soon.

Test Plan:
CI.
Built FB4A with gnustl and saw no references to cxa_thread_atexit
in the PyTorch libraries.

Reviewed By: ilia-cher

Differential Revision: D28720762

fbshipit-source-id: 0f13c7ac5a108b95f8fde6dbc63c6b8bdb8599de
2021-05-27 20:49:03 -07:00
Luca Wehrstedt
44daf1930b Migrate remaining shared_ptr<Future> to intrusive_ptr (#58420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58420

In https://github.com/pytorch/pytorch/pull/57636 I migrated most uses of Future to an intrusive_ptr. I thought I had all of them but I missed a couple. These are the remaining ones. (The next PR will make it impossible to add new usages of shared_ptr).
ghstack-source-id: 129567071

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28477285

fbshipit-source-id: 75008276baa59e26b450e942c009ec7e78f89b13
2021-05-21 13:15:20 -07:00
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Robin Cheng
5d940e2fbc [TSAN] Fix PythonEngine data-race-on-vptr. (#56808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56808

For information about data-race-on-vptr in general, see https://www.internalfb.com/intern/wiki/TSAN/Common_Concurrency_Mistakes/Stopping_a_Thread_in_Destructor/

Engine::~Engine() was previously tasked with stopping the threads. This causes a data race on the object's vptr when PythonEngine is being destructed. This fixes the data race by making ~PythonEngine trigger the thread stopping before going down to the base class's destructor.

Test Plan:
Many tests are affected, but here's one example:

buck test mode/dev-tsan -c fbcode.tsan_strict_mode=true //oculus/research/orcoptics/deep_learning/srg_nn/tests:test_grating_net -- 'test_train (oculus.research.orcoptics.deep_learning.srg_nn.tests.test_grating_net.TestGratingNet)' --run-disabled

Reviewed By: walterddr, albanD

Differential Revision: D27972384

fbshipit-source-id: 8b70fec8d9326497c591a2777b355ea590a85082
2021-04-23 17:39:27 -07:00
Richard Zou
d2d1112513 Set ThreadLocalState correctly in the autograd engine (#56174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56174

evaluate_function:
1. calls the autograd function (call_function)
2. accumulates gradients into buffers

Previously, ThreadLocalStateGuard only covered part of `call_function`.
However, it should cover all Tensor operations in `evaluate_function`,
so this PR moves it to do so.

One alternative would have been to move ThreadLocalStateGuard to here:
71f9e99e29/torch/csrc/autograd/engine.cpp (L394)

Unfortunately that adds 2% additional instructions according to the
instruction count benchmark in the next section. This is because
`evaluate_function` does an early return:
71f9e99e29/torch/csrc/autograd/engine.cpp (L732-L735)
If this is preferred, please let me know.

Test Plan:
- run existing tests. It's hard to actually come up with a test case for
this.

Benchmark plan:

TL;DR: Instruction count decreases by a little after this PR.
```
import torch
from torch.utils.benchmark import Timer

timer = Timer(
    stmt="""\
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);""",
    setup="""\
auto x = torch::ones({}, torch::requires_grad());
auto y = x * 2;""",
    language="cpp")

stats = timer.collect_callgrind()
print(stats)
```
This gave the following:
```
Before:
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f4b28ce6a90>
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);
setup:
  auto x = torch::ones({}, torch::requires_grad());
  auto y = x * 2;

                           All          Noisy symbols removed
    Instructions:      3514184                    3514184
    Baseline:                0                          0
100 runs per measurement, 1 thread

After:
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fdbc9d187d0>
torch::autograd::grad({y}, {x}, {}, /*retain_grad=*/true);
setup:
  auto x = torch::ones({}, torch::requires_grad());
  auto y = x * 2;

                           All          Noisy symbols removed
    Instructions:      3513884                    3513884
    Baseline:                0                          0
100 runs per measurement, 1 thread
```

Reviewed By: albanD

Differential Revision: D27799283

Pulled By: zou3519

fbshipit-source-id: 0a8213824e08c04748d38e66604c73f395285d63
2021-04-15 20:57:27 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Edward Yang
1f36ce6e4d Restore storage on meta tensors; increase meta coverage (#53973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973

Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y.

The first part is restoring the concept of storage to meta tensors.  Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by:

* Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage
* Turn on memory overlap checking in TensorIterator even for meta tensors
* Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment).

The second part is adding more support for the most used functions in the test suite.

* Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!)
* `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case)
* `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added)
* Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them
* Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway)
* `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently.

Getting more meta function support triggers a number of bugs in the test suite, which I then fix:

- Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739
- dlpack obviously doesn't work with meta tensors, I just disabled the test

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D27036572

Test Plan: Imported from OSS

Reviewed By: agolynski, bdhirsh

Pulled By: ezyang

fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78
2021-03-29 08:37:46 -07:00
Jeffrey Wan
4739d15a67 Skip some nodes during discovery using sequence number (#52180)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/12635

This change will help us speed up autograd's discovery algorithm in cases where we use `.grad` and we try to "unroll" the training loop. For example the example in the issue and also https://github.com/pytorch/pytorch/pull/52180#issuecomment-783400832 observe an unbounded multiple of speed-up.

We do this by adding a new sequence_nr-type numbering: for each node, we maintain the length of the longest path from it to any leaf node. How does this help us speed up discovery (dfs)? Previously the bottleneck was that the dfs that computes which nodes need to be executed always explored every node. With this change, before we run dfs, we first compute the mininum seq_nr among all the nodes passed as the `inputs`. If let this be some number N, intuitively this means that dfs should stay at least N units away from any leaf node. So, if we find ourselves too close to any leaf node, we should stop our search early.

Edit:
After some discussion offline, the plan is:
 - make old sequence_nr a construct of the profiler. This means we can avoid accessing thread local state in cases where the profiler is disabled. Note that we cannot replace sequence_nr as-is because profiler's use-case requires that thread-id + sequence_nr can uniquely identify a given node in order for downstream users/programs to correlate nodes from backward and forward passes. This means we must maintain two sequence_nr's and that we have an extra field in Node.
 - In a future PR, we can potentially remove sequence_nr entirely from the profiler as well, but we avoid doing it now because we haven't measured, and its a larger effort because we'd have to mess around with the dispatcher and profiler

Testing with this [code](https://gist.github.com/kyunghyuncho/5fb9991ce1233f909051854a84b7148e), we see that runtime no longer increases as we iterate.

Before:
```
100: Time taken: 0.47s, loss: 1.1e+06
200: Time taken: 0.064s, loss: 6.5e+05
300: Time taken: 0.088s, loss: 4.4e+05
400: Time taken: 0.1s, loss: 3.2e+05
500: Time taken: 0.12s, loss: 2.5e+05
600: Time taken: 0.15s, loss: 2e+05
700: Time taken: 0.18s, loss: 1.7e+05
800: Time taken: 0.2s, loss: 1.4e+05
900: Time taken: 0.22s, loss: 1.2e+05
1000: Time taken: 0.24s, loss: 1.1e+05
1100: Time taken: 0.27s, loss: 9.3e+04
1200: Time taken: 0.3s, loss: 8.3e+04
1300: Time taken: 0.34s, loss: 7.4e+04
1400: Time taken: 0.36s, loss: 6.7e+04
1500: Time taken: 0.38s, loss: 6.1e+04
1600: Time taken: 0.4s, loss: 5.6e+04
1700: Time taken: 0.42s, loss: 5.1e+04
1800: Time taken: 0.44s, loss: 4.7e+04
1900: Time taken: 0.47s, loss: 4.4e+04
2000: Time taken: 0.5s, loss: 4.1e+04
```
After:
```
100: Time taken: 0.49s, loss: 1.2e+06
200: Time taken: 0.031s, loss: 6.9e+05
300: Time taken: 0.031s, loss: 4.6e+05
400: Time taken: 0.031s, loss: 3.3e+05
500: Time taken: 0.031s, loss: 2.6e+05
600: Time taken: 0.031s, loss: 2.1e+05
700: Time taken: 0.031s, loss: 1.7e+05
800: Time taken: 0.031s, loss: 1.4e+05
900: Time taken: 0.031s, loss: 1.2e+05
1000: Time taken: 0.031s, loss: 1.1e+05
1100: Time taken: 0.031s, loss: 9.6e+04
1200: Time taken: 0.031s, loss: 8.6e+04
1300: Time taken: 0.031s, loss: 7.7e+04
1400: Time taken: 0.031s, loss: 7e+04
1500: Time taken: 0.031s, loss: 6.3e+04
1600: Time taken: 0.031s, loss: 5.8e+04
1700: Time taken: 0.031s, loss: 5.3e+04
1800: Time taken: 0.031s, loss: 4.9e+04
1900: Time taken: 0.031s, loss: 4.5e+04
2000: Time taken: 0.032s, loss: 4.2e+04

```
Testing w/ small graph to check for regression:
```
import torch
from torch.utils.benchmark import Timer

setup="""
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
"""

stmt="""
torch.autograd.grad(a*b, [a, b], gradient)
"""

timer = Timer(stmt, setup)

print(timer.timeit(10000))
print(timer.collect_callgrind(100))
```
Result: there doesn't seem to be any significant regression
```
Time before: 12.74 us
Time after: 13.12 us
Instruction count before:
                           All          Noisy symbols removed
    Instructions:      8078960                    8000882
    Baseline:             4226                       3838
Instruction count after:
                           All          Noisy symbols removed
    Instructions:      8091846                    8017940
    Baseline:             4336                       3838
100 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52180

Reviewed By: gchanan, zhangguanheng66

Differential Revision: D26794387

Pulled By: soulitzer

fbshipit-source-id: c00d387a29f151109c33dc6f1b56a8f275cdec58
2021-03-04 16:13:53 -08:00
Nikita Shulga
72ec718373 Leak autograd threads after wait limit (#53170)
Summary:
Leak autograd threads if TORCH_AUTOGRAD_SHUTDOWN_WAIT_LIMIT is reached
(default to 10 seconds)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53170

Reviewed By: zhangguanheng66

Differential Revision: D26821983

Pulled By: malfet

fbshipit-source-id: 310960564da7cd8c9f475432a8efbee32cfe6009
2021-03-04 14:42:15 -08:00
Jeffrey Wan
2900cf2b94 Refactor autograd discovery code (#52057)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/34067 by using https://github.com/pytorch/pytorch/issues/34426 by hczhu
In addition to removing the unnecessary any() we do also:
 - Get rid of the outer loop since graph_root also needs to be checked
 - Update psuedo code description so it matches what the code does
 - Add some comments explaining the difference between assigning `info.needed_` and `info.captures_` in terms of how that affects discovery
 - [edit: another benefit is that exec_info entries are no longer created for all reachable nodes]

This PR is on top of https://github.com/pytorch/pytorch/issues/51940, so once that lands rebasing on top of master should get rid of the extra commits and changes

I'm not sure if this change will bring a lot of performance gains, but the main benefit is that the code is easier to read.

Trivial graph:
```
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

Timer before:
  15.45 us
Time after:
  14.33 us
1 measurement, 10000 runs , 1 thread

Instructions after:
                           All          Noisy symbols removed
    Instructions:      8271213                    8193169
    Baseline:             4244                       3838
Instructions before:
                           All          Noisy symbols removed
    Instructions:      8142843                    8054463
    Baseline:             4280                       3838
100 runs per measurement, 1 thread
```
Small graph:
```
torch.autograd.grad((b*a.exp()+a*b.exp()).sum(), (a, b))
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)

Time before:
  52.25 us
Time after:
  50.80 us
1 measurement, 10000 runs , 1 thread

Instruction count before:
                           All          Noisy symbols removed
    Instructions:     25601257                   25518229
    Baseline:             4228                       3838
Instruction count after:
                           All          Noisy symbols removed
    Instructions:     25606533                   25522797
    Baseline:             4228
100 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52057

Reviewed By: ngimel

Differential Revision: D26432207

Pulled By: soulitzer

fbshipit-source-id: beef68344d66e9e286378e31e3311ba43c25c749
2021-02-12 16:22:35 -08:00
Jeffrey Wan
aa2fede201 Fix autograd when inputs contains tensors without materialized grad_fn (#51940)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39784
At the time the issue was filed, there was only issue (1) below.

There are actually now two issues here:
1. We always set all inputs passed in through `inputs` arg as `needed = True` in exec_info. So if we pass in an input that has a grad_fn that is not materialized, we create an entry of exec_info with nullptr as key with `needed = True`. Coincidentally, when we perform simple arithmetic operations, such as "2 * x", one of the next edges of mul is an invalid edge, meaning that its grad_fn is also nullptr. This causes the discovery algorithm to set all grad_fns that have a path to this invalid_edge as `needed = True`.
2. Before the commit that enabled the engine skipped the dummy node, we knew that root node is always needed, i.e., we hardcode `exec_info[&graph_root]=true`. The issue was that this logic wasn't updated after the code was updated to skip the graph root.

To address (1), instead of passing in an invalid edge if an input in `inputs` has no grad_fn, we create a dummy grad_fn. This is done in both python and cpp entry points. The alternative is to add logic for both backward() and grad() cases to check whether the grad_fn is nullptr and set needed=false in that case (the .grad() case would be slightly more complicated than the .backward() case here).

For (2), we perform one final iteration of the discovery algorithm so that we really know whether we need to execute the graph root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51940

Reviewed By: VitalyFedyunin

Differential Revision: D26369529

Pulled By: soulitzer

fbshipit-source-id: 14a01ae7988a8de621b967a31564ce1d7a00084e
2021-02-11 09:22:15 -08:00
Jeffrey Wan
7b9ca54ecf Reset checkpoint_valid flag when error happens during function execution (#51746)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37874, https://github.com/pytorch/pytorch/issues/51743

Uses RAII to manage the flag so that it gets reset properly on exception

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51746

Reviewed By: izdeby

Differential Revision: D26319619

Pulled By: soulitzer

fbshipit-source-id: ea1235438ba516f99195c83fa23d5880f9977c93
2021-02-08 17:48:25 -08:00
Alban Desmaison
e160362837 Add range assert in autograd engine queue lookup (#50372)
Summary:
Follow up to  https://github.com/pytorch/pytorch/issues/49652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50372

Reviewed By: zhangguanheng66

Differential Revision: D25872203

Pulled By: albanD

fbshipit-source-id: 8d6f30f17fba856c5c34c08372767349a250983d
2021-01-11 15:16:35 -08:00
Alban Desmaison
fc2ead0944 Autograd engine, only enqueue task when it is fully initialized (#50164)
Summary:
This solves a race condition where the worker thread might
see a partially initialized graph_task

Fixes https://github.com/pytorch/pytorch/issues/49652

I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50164

Reviewed By: zou3519

Differential Revision: D25824954

Pulled By: albanD

fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb
2021-01-08 05:30:11 -08:00
Ansley Ussery
c619892482 Fix errata (#49903)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49903

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25718411

Pulled By: ansley

fbshipit-source-id: 0cc365c5a53077752dc1c5a5c4a65b873baa3604
2020-12-28 20:40:41 -08:00
Jeffrey Wan
d20483a999 Skip dummy node creation for autograd engine when there is a single input and place on correct queue (#47592)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42890
 - Removes dummy node
 - Places graph root on the correct queue based on input buffer's device instead of cpu queue by default

cpu - no significant change in speed (too noisy to measure), but we see up to 7% reduction in small graphs
cuda - small reduction in speed (still very noisy) and up to ~20% reduction in instruction count for small graphs

**CPU**
Code:
```
import torch
from torch.utils.benchmark import Timer

setup="""
a = torch.rand((2, 2), requires_grad=True)
b = torch.rand((2, 2), requires_grad=True)
gradient = torch.ones(2, 2)
"""

stmt="""
torch.autograd.grad(a*b, [a, b], gradient)
"""

timer = Timer(stmt, setup)

print(timer.timeit(10000))
print(timer.collect_callgrind(100))
```

Before (when dummy node is not skipped):
```
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

  26.62 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7efee44ad8e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

                           All          Noisy symbols removed
    Instructions:      9755488                    9659378
    Baseline:             4300                       3784
100 runs per measurement, 1 thread
```

After
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f56961a7730>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

  26.78 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f56961a78e0>
torch.autograd.grad(a*b, [a, b], gradient)
setup:
  a = torch.rand((2, 2), requires_grad=True)
  b = torch.rand((2, 2), requires_grad=True)
  gradient = torch.ones(2, 2)

                           All          Noisy symbols removed
    Instructions:      9045508                    8939872
    Baseline:             4280                       3784
100 runs per measurement, 1 thread
```
**Cuda**

Before
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f84cbaa1ee0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  70.49 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f84cbaa1e50>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      5054581                    4951911
    Baseline:             4105                       3735
100 runs per measurement, 1 thread
```

Remove dummy node only
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fbf29c67eb0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  55.65 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fbf29c67e20>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      5002105                    4900841
    Baseline:             4177                       3731
100 runs per measurement, 1 thread
```

Remove dummy node and put in correct queue
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fb64438ce80>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

  27.56 us
  1 measurement, 10000 runs , 1 thread
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7fb64438cdf0>
torch.autograd.grad(out, [x, y], gradient)
setup:
  x = torch.rand((2,2), requires_grad=True, device="cuda")
  y = torch.rand((2,2), requires_grad=True, device="cuda")
  out = x + y
  gradient = torch.ones(2, 2).cuda()

                           All          Noisy symbols removed
    Instructions:      4104433                    4007555
    Baseline:             4159                       3735
100 runs per measurement, 1 thread
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47592

Reviewed By: ailzhang

Differential Revision: D24890761

Pulled By: soulitzer

fbshipit-source-id: f457376e4a882f8a59476e8c1e708391b1a031a2
2020-11-16 11:33:35 -08:00