pytorch/docs/source/notes
Michael Carilli be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
..
amp_examples.rst Reference amp tutorial (recipe) from core amp docs (#44725) 2020-09-16 11:37:58 -07:00
autograd.rst Add no-grad inference mode note (#58513) 2021-05-25 13:06:54 -07:00
broadcasting.rst Fixes docs (#51439) 2021-01-31 22:00:26 -08:00
cpu_threading_runtimes.svg Update CPU threading doc (#33083) 2020-02-11 14:13:51 -08:00
cpu_threading_torchscript_inference.rst Upgrade MKL-DNN to DNNL v1.2 (#32422) 2020-03-26 22:07:59 -07:00
cpu_threading_torchscript_inference.svg Lint trailing newlines (#54737) 2021-03-30 13:09:52 -07:00
cuda.rst [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833) 2021-06-13 12:09:56 -07:00
ddp.rst Forbid trailing whitespace (#53406) 2021-03-05 17:22:55 -08:00
extending.rst Remove legacy constructor calls from pytorch codebase. (#54142) 2021-04-11 15:45:17 -07:00
faq.rst [DataLoader][doc] Randomness for base_seed generator and NumPy seed (#56528) 2021-04-22 09:40:45 -07:00
gradcheck.rst Add first draft of gradcheck note (#55966) 2021-04-27 14:33:42 -07:00
hip.rst Add HIP (ROCm) semantics doc (#57871) 2021-05-12 12:34:07 -07:00
large_scale_deployments.rst Move ThreadLocalDebugInfo to c10 (#37774) 2020-05-11 19:27:41 -07:00
modules.rst Note on Modules for 1.8 docs (#51536) 2021-02-04 11:28:11 -08:00
multiprocessing.rst Update docs for master to remove Python 2 references (#36336) 2020-04-16 10:15:48 -07:00
randomness.rst [DataLoader][doc] Randomness for base_seed generator and NumPy seed (#56528) 2021-04-22 09:40:45 -07:00
serialization.rst docs: reference links to serialization.html (#54659) 2021-03-29 10:15:07 -07:00
windows.rst Forbid trailing whitespace (#53406) 2021-03-05 17:22:55 -08:00