Commit Graph

4 Commits

Author SHA1 Message Date
Pritam Damania
149c646b74 Detect and handle NCCL errors appropriately in ProcessGroupNCCL. (#25012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25012

Resubmitting https://github.com/pytorch/pytorch/pull/22907 with build fix.

This change adds the following functionality:
1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the
appropriate exception.
2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the
cached communicators and removes them from the cache.
3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block
forever waiting for work.
4) Added a simulate_nccl_errors.py script to simulate NCCL errors.

https://github.com/pytorch/pytorch/issues/17882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907

Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught.

Differential Revision: D16958078

fbshipit-source-id: 662b0b8b8ee250e2b6d15bdfc9306d71c4f66219
2019-08-22 16:12:41 -07:00
Michael Suo
b3008fad2e Revert D16220638: [pytorch][PR] Detect and handle NCCL errors appropriately in ProcessGroupNCCL.
Differential Revision:
D16220638

Original commit changeset: fbc8881ea0c3

fbshipit-source-id: 10d2f3d446064adb3cf44e1f9911dcf259bbfbfb
2019-08-21 09:40:38 -07:00
Pritam Damania
0a23151293 Detect and handle NCCL errors appropriately in ProcessGroupNCCL. (#22907)
Summary:
This change adds the following functionality:
1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the
appropriate exception.
2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the
cached communicators and removes them from the cache.
3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block
forever waiting for work.
4) Added a simulate_nccl_errors.py script to simulate NCCL errors.

https://github.com/pytorch/pytorch/issues/17882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907

Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught.

Differential Revision: D16220638

fbshipit-source-id: fbc8881ea0c38a4d09a77045691e36557b7b0b25
2019-08-20 20:37:37 -07:00
Shen Li
725d6cd8ce Extract common classes and functions from test_c10d to common_distributed (#23660)
Summary:
MultiProcessTestCase will be useful for both c10d and rpc tests. So, this diff extracts that class and some common decorators to a separate file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23660

Reviewed By: pietern

Differential Revision: D16602865

Pulled By: mrshenli

fbshipit-source-id: 85ad47dfb8ba187b7debeb3edeea5df08ef690c7
2019-08-02 09:19:32 -07:00