pytorch/docs
Pritam Damania 704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
..
caffe2 Enable UFMT on a bunch of low traffic Python files outside of main files (#106052) 2023-07-27 01:01:17 +00:00
cpp Enable UFMT on a bunch of low traffic Python files outside of main files (#106052) 2023-07-27 01:01:17 +00:00
source [RESUBMIT] Standardize on error types for distributed errors. (#108191) 2023-08-30 21:47:39 +00:00
.gitignore
libtorch.rst Replace master with main in links and docs/conf.py (#100176) 2023-05-02 18:20:32 +00:00
make.bat
Makefile [exportdb] Setup website (#104288) 2023-07-01 01:03:56 +00:00
README.md
requirements.txt update requirements.txt in /docs (#101092) 2023-05-12 03:19:36 +00:00

Please see the Writing documentation section of CONTRIBUTING.md for details on both writing and building the docs.