pytorch/test/distributed/elastic
Pritam Damania 704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
..
agent/server/test [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
events Run tests in USE_PYTEST_LIST through run_tests (#95659) 2023-02-28 22:09:01 +00:00
metrics
multiprocessing [BE] Enable ruff's UP rules and autoformat distributed/ (#105433) 2023-07-19 14:27:11 +00:00
rendezvous
timer Fix "sandcastle_skip_if decorator name is confusing" (#95649) 2023-03-03 09:29:40 +00:00
utils [RESUBMIT] Standardize on error types for distributed errors. (#108191) 2023-08-30 21:47:39 +00:00