pytorch/docs
Michael Carilli be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
..
caffe2 Lint trailing newlines (#54737) 2021-03-30 13:09:52 -07:00
cpp Add no-grad inference mode note (#58513) 2021-05-25 13:06:54 -07:00
source [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833) 2021-06-13 12:09:56 -07:00
.gitignore .gitignore for the docs folder 2019-10-08 12:18:30 -07:00
libtorch.rst DOC: Building libtorch using CMake (#44196) 2020-10-21 14:29:36 -07:00
make.bat Sphinx parallel build (#38785) 2020-05-21 13:03:55 -07:00
Makefile DOC: fail to build if there are warnings (#41335) 2020-07-28 22:33:44 -07:00
README.md Add docs/README.md to make existing doc build info more discoverable (#49286) 2020-12-16 11:55:45 -08:00
requirements.txt [1/n][torch/elastic] Move torchelastic docs *.rst (#148) 2021-05-04 00:57:56 -07:00

Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.