pytorch/docs/source/notes
Michael Carilli 5640b79bf8 Allow consumer ops to sync on GraphRoot's gradient (#45787)
Summary:
Currently, a GraphRoot instance doesn't have an associated stream.  Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream.  If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.

The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
    # loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
    # GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
    loss.backward()

    # Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
    # and the side_stream context is irrelevant.  GraphRoot's interaction with its first consumer(s) is the spot where
    # the side_stream context causes a problem.
```

This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)

The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.

With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
    loss.backward()

# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
    kickoff_grad = torch.ones_like(loss)
    loss.backward(gradient=kickoff_grad)

# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)

# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
    loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787

Reviewed By: nairbv

Differential Revision: D24138376

Pulled By: albanD

fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
2020-10-07 08:53:53 -07:00
..
amp_examples.rst Reference amp tutorial (recipe) from core amp docs (#44725) 2020-09-16 11:37:58 -07:00
autograd.rst Doc note for complex (#41252) 2020-07-16 08:53:27 -07:00
broadcasting.rst [docs] Update broadcasting and cuda semantics notes (#6904) 2018-04-24 13:41:24 -04:00
cpu_threading_runtimes.svg Update CPU threading doc (#33083) 2020-02-11 14:13:51 -08:00
cpu_threading_torchscript_inference.rst Upgrade MKL-DNN to DNNL v1.2 (#32422) 2020-03-26 22:07:59 -07:00
cpu_threading_torchscript_inference.svg Threading and CPU Inference note 2019-07-29 15:45:49 -07:00
cuda.rst Allow consumer ops to sync on GraphRoot's gradient (#45787) 2020-10-07 08:53:53 -07:00
ddp.rst Fix wrong link in docs/source/notes/ddp.rst (#40484) 2020-06-28 13:55:56 -07:00
extending.rst Don't materialize output grads (#41821) 2020-08-11 04:27:07 -07:00
faq.rst Revert "Revert D21337640: [pytorch][PR] Split up documentation into subpages and clean up some warnings" (#37778) 2020-05-04 14:32:35 -07:00
large_scale_deployments.rst Move ThreadLocalDebugInfo to c10 (#37774) 2020-05-11 19:27:41 -07:00
multiprocessing.rst Update docs for master to remove Python 2 references (#36336) 2020-04-16 10:15:48 -07:00
randomness.rst Update determinism documentation (#41692) 2020-08-31 21:06:24 -07:00
serialization.rst Makes the use of the term "module" consistent through the serialization note (#41563) 2020-07-16 14:59:49 -07:00
windows.rst Correct the windows docs (#43479) 2020-08-25 13:41:24 -07:00