mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.
Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).
The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.
For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.
----------------------------------
Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
|
||
|---|---|---|
| .. | ||
| amp_examples.rst | ||
| autograd.rst | ||
| broadcasting.rst | ||
| cpu_threading_runtimes.svg | ||
| cpu_threading_torchscript_inference.rst | ||
| cpu_threading_torchscript_inference.svg | ||
| cuda.rst | ||
| ddp.rst | ||
| extending.rst | ||
| faq.rst | ||
| gradcheck.rst | ||
| hip.rst | ||
| large_scale_deployments.rst | ||
| modules.rst | ||
| multiprocessing.rst | ||
| randomness.rst | ||
| serialization.rst | ||
| windows.rst | ||