pytorch/torch/distributed/elastic
Kiuk Chung b08309ee0a (torch/elastic) skip logging structured error info if error_file is not set (#73477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477

resolves https://github.com/pytorch/pytorch/issues/73465

This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`.

Test Plan:
Induce an error on the agent process by `kill -15 $AGENT_PID`
```
python -m torch.distributed.run \
   --nproc_per_node 2 \
   --nnodes 1:1 \
   --rdzv_backend c10d \
  --rdzv_endpoint localhost:29500 \
  --monitor_interval 3 test.py
```

Produces

{F704936697}

In contrast to the duplicated error before:

{F704936729}

Reviewed By: d4l3k

Differential Revision: D34501852

fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8
(cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)
2022-03-01 19:31:44 +00:00
..
agent (torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749) 2021-11-05 12:18:46 -07:00
events fix torch.distributed.elastic event docs (#64974) 2021-09-17 12:27:09 -07:00
metrics [codemod][type-comments] Convert type comments in api.py (#73084) 2022-02-19 00:31:45 +00:00
multiprocessing (torch/elastic) skip logging structured error info if error_file is not set (#73477) 2022-03-01 19:31:44 +00:00
rendezvous Revise the socket implementation of c10d (#68226) 2021-11-16 20:49:25 -08:00
timer
utils Decouple MapDataPipe from Dataset (#70991) 2022-01-07 14:28:41 -08:00
__init__.py