Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25012
Resubmitting https://github.com/pytorch/pytorch/pull/22907 with build fix.
This change adds the following functionality:
1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the
appropriate exception.
2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the
cached communicators and removes them from the cache.
3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block
forever waiting for work.
4) Added a simulate_nccl_errors.py script to simulate NCCL errors.
https://github.com/pytorch/pytorch/issues/17882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907
Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught.
Differential Revision: D16958078
fbshipit-source-id: 662b0b8b8ee250e2b6d15bdfc9306d71c4f66219
Summary:
This change adds the following functionality:
1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the
appropriate exception.
2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the
cached communicators and removes them from the cache.
3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block
forever waiting for work.
4) Added a simulate_nccl_errors.py script to simulate NCCL errors.
https://github.com/pytorch/pytorch/issues/17882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907
Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught.
Differential Revision: D16220638
fbshipit-source-id: fbc8881ea0c38a4d09a77045691e36557b7b0b25
Summary:
MultiProcessTestCase will be useful for both c10d and rpc tests. So, this diff extracts that class and some common decorators to a separate file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23660
Reviewed By: pietern
Differential Revision: D16602865
Pulled By: mrshenli
fbshipit-source-id: 85ad47dfb8ba187b7debeb3edeea5df08ef690c7