mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. **DistError** - the base type of all distributed errors 2. **DistBackendError** - this already existed and referred to PG backend errors 3. **DistStoreError** - for errors originating from the store 4. **DistNetworkError** - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang |
||
|---|---|---|
| .. | ||
| caffe2 | ||
| cpp | ||
| source | ||
| .gitignore | ||
| libtorch.rst | ||
| make.bat | ||
| Makefile | ||
| README.md | ||
| requirements.txt | ||
Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.