mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434 Reland of https://github.com/pytorch/pytorch/pull/33325, since the unit test was flaky and failed on land. To ensure that the test is not flaky, I bumped the timeout so the rendezvous does not timeout (timing out the rendezvous in 1s led to the flakiness). I also generalized our mechanism for retrying on errors to include retrying on errors due to timeout in rendezvous. ghstack-source-id: 98558377 Test Plan: Added UT test_tcp_store_timeout_set Differential Revision: D19935390 fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
9 lines
368 B
Python
9 lines
368 B
Python
from datetime import timedelta
|
|
|
|
# Default process group wide timeout, if applicable.
|
|
# This only applies to the gloo and nccl backends
|
|
# (only if NCCL_BLOCKING_WAIT is set to 1). To make an attempt at
|
|
# backwards compatibility with THD, we use an extraordinarily high default
|
|
# timeout, given that THD did not have timeouts.
|
|
default_pg_timeout = timedelta(minutes=30)
|