mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 00:21:07 +01:00
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.
Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).
The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.
For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.
----------------------------------
Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
|
||
|---|---|---|
| .. | ||
| caffe2 | ||
| cpp | ||
| source | ||
| .gitignore | ||
| libtorch.rst | ||
| make.bat | ||
| Makefile | ||
| README.md | ||
| requirements.txt | ||
Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.