Summary:
When workers finish their work TE agent will start `synchronize_barrier` procedure. The barrier will wait for other agents at the end of the execution.
There is a race condition may happen: The barrier uses TCPStore which is located on Rank0. When Rank0 finishes the work, other ranks may still be in a process of executing `get_all` method. This means that some of them will fail because the TCPStore will be destroyed.
The fix adds additional check on Rank0 process: Rank0 process now waits for all other ranks to finish before terminating the process.
Test Plan: unit tests
Differential Revision: D35227180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74931
Approved by: https://github.com/kiukchung
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73982
Currently there is no way for users using torchelastic to override NCCL_ASYNC_ERROR_HANDLING=0. This PR enables this.
Test Plan:
Added unittests
Manual testing
```
$ torchx run fb.dist.ddp -- --img torchx_examples -m print_env_vars.py --env NCCL_ASYNC_ERROR_HANDLING=0
```
Validated the NCCL_ASYNC_ERROR_HANDLING in the process running `print_env_vars.py` is indeed `0`.
Reviewed By: mannatsingh, aivanou
Differential Revision: D34765786
fbshipit-source-id: 3f9f6d3b61e7d265adf689d387e020ab534c9259
(cherry picked from commit 2b787b46c6d37f049fe39eb64eecedf68799e75c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73477
resolves https://github.com/pytorch/pytorch/issues/73465
This `log.error` is not necessary (and its also not human-friendly formatted) because we end up re-raising the same exception after recording the exception into an error_file (if present). Eventually python should handle this error the way it handles any other errors and will write the trace info into the console. This additional logging produces duplicate error console prints, which affects all users whose schedulers do not set `TORCHELASTIC_ERROR_FILE` env var when calling `torch.distributed.run`.
Test Plan:
Induce an error on the agent process by `kill -15 $AGENT_PID`
```
python -m torch.distributed.run \
--nproc_per_node 2 \
--nnodes 1:1 \
--rdzv_backend c10d \
--rdzv_endpoint localhost:29500 \
--monitor_interval 3 test.py
```
Produces
{F704936697}
In contrast to the duplicated error before:
{F704936729}
Reviewed By: d4l3k
Differential Revision: D34501852
fbshipit-source-id: 14fed18a9664130980205007ff104ff15a5fd4f8
(cherry picked from commit 0b7c51ba8834f4a4a5376f585c0795cb43be6521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226
**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**
This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:
- significantly better error handling and much more verbose logging (see the example output below)
- explicit support for IPv6 and dual-stack sockets
- correct handling of signal interrupts
- better Windows support
A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.
## Example Output
```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987
Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.
Reviewed By: Babar, wilson100hong, mrshenli
Differential Revision: D32372333
fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a
Summary:
It turns out my lint doesn't work on CI all the time because of shell differences. I'm working on a new more comprehensive lint in https://github.com/pytorch/pytorch/pull/66826 and it'd be nice if these could be cleared first.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67583
Reviewed By: H-Huang, mruberry
Differential Revision: D32045155
Pulled By: janeyx99
fbshipit-source-id: ecfe9f008310c28e3b731e246c2b2ed0106d03b1
Summary:
Action following discussion with distributed and r2p team--the tests under elastic in distributed should be owned by oncall: r2p and not distributed.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67293
Reviewed By: jbschlosser
Differential Revision: D31973779
Pulled By: janeyx99
fbshipit-source-id: 05875a7600c6eb1da1310a48e1e32a1a69461c55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66182
closes https://github.com/pytorch/pytorch/issues/63174
Does a few things:
1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header
NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).
Test Plan:
Sample
```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2021-10-05_17:37:22
host : devvm4955.prn0.facebook.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3296201)
error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
traceback :
Traceback (most recent call last):
File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
return f(*args, **kwargs)
File "main.py", line 28, in main
raise RuntimeError(args.throws)
RuntimeError: foobar
============================================================
```
Reviewed By: cbalioglu, aivanou
Differential Revision: D31416492
fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65149Fixes#64789
There is a race condition between when the free port is acquired to when it is used to create the store in which it may have been used. Since this test only tests that timeout is triggered for tcpstore, we can bind to any port on tcpstore creation.
This only affects the test on the server (since that is where the port is used), but I changed both tests for clarity
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang cbalioglu gcramer23
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D30993166
Pulled By: H-Huang
fbshipit-source-id: eac4f28d641ac87c4ebee89df83f90955144f2f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63442
Continuation of https://github.com/pytorch/pytorch/pull/62051, I've
enabled elastic and launcher tests to run in opt-asan mode which is supported
with spawn multiprocessing.
This allows us to completely get rid of fork based tests from torch.distributed
and have all tests run in spawn mode.
ghstack-source-id: 136057123
Test Plan: waitforbuildbot
Reviewed By: cbalioglu
Differential Revision: D30384267
fbshipit-source-id: ad3447cfb9d6e31e7ec8332d64c8ff1054858dcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887
1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`
Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D29784152
fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61822
Added scuba logging to the following files:
- dynamic_rendezvous.py
- c10d_rendezvous_backend.py
NOTE: This diff introduces the use of python's inspect module to easily allow for obtaining the calling method name and filename when logging. This module can mess with python's garbage collector, so special care was taken to never store references to results from inspect.stack() longer than absolutely needed.
Test Plan:
The following tests can be run.
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:c10d_rendezvous_backend_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/rendezvous:dynamic_rendezvous_test
```
```
buck run mode/dev-nosan //caffe2/test/distributed/elastic/events:lib_test
```
Reviewed By: aivanou
Differential Revision: D29643774
fbshipit-source-id: f10cd5ebf8f6860856267bc2483c0b85faacb0fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602
The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.
When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.
Test Plan: unittests, sandcastle
Reviewed By: cbalioglu
Differential Revision: D29671783
fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60558
Fixes 1/2 flaky tests as described in: https://github.com/pytorch/pytorch/issues/60260
`test_create_store_timeout_on_server` tests whether trying to create a `c10d::TCPStore` server on an already taken port actually fails with an `IOError`. Prior to this change the `utils.get_socket_with_port()` util method was used to synthetically reserve a port, then try creating the `TCPStore` on that port to validate the `IOError`. The issue with this is that on a dual stack ip setup, `get_socket_with_port()` (since it uses `socket.AF_UNSPEC`) reserves an ipv6 port, while `TCPStore` will try binding to an ipv4 port, so an `IOError` is not observed.
Changing the logic of the test to create two `TCPStore` servers. The first chooses a free port (by passing `server_port=0`) while the second tries to create a `TCPStore` server on the port that the first store is already running on. This would induce an `IOError` on the second store's constructor.
NOTE: this change does not solve another broader issue with `TCPStore` where the server and workers can listen and connect on ipv4 vs ipv6 when they are running on dual-stak ip hosts without ipv4 DNS entry and/or a `/etc/gai.conf` specifying the preferred bind ordering. See: https://github.com/pytorch/pytorch/pull/49124
Test Plan:
```
buck test //caffe2/test/distributed/elastic/utils:distributed_test
```
Reviewed By: cbalioglu
Differential Revision: D29334947
fbshipit-source-id: 76b998c59082cb04c0e86b7a1f3b509367fa0136
Summary:
`IS_PYTORCH_CI` and `IN_CI` are used randomly, however in some cases IN_CI is not currently set because it only exist in .circleci/scripts/setup_ci_environment.sh. This cleans up the 2 flags and only use IN_CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60279
Test Plan: CI
Reviewed By: seemethere
Differential Revision: D29239545
Pulled By: walterddr
fbshipit-source-id: a069424a2bb8790a3adfdaf0dc460301026bf8c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719
Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.
Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.
Reviewed By: cbalioglu, H-Huang
Differential Revision: D28997436
fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59712
When worker process fails in fb due to signal failure, the TerminationHandler writes error reply file. Recently the error reply file was changed for mast jobs. The Json value of ``timestamp`` is string, even though in the thrift struct it is int: https://fburl.com/diffusion/upa228u5
This diff adds support for casting str timestamp to int.
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test
Reviewed By: suphoff
Differential Revision: D28995827
fbshipit-source-id: 333448cfb4d062dc7fe751ef5839e66bfcb3ba00
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663
This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D28971809
fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59657
Introduce tests that test elastic agent with c10d and etc2-v2 rendezvous backends.
Added a port allocation method that uses sockets to find an available port for the c10d backend. This way, agents that are created will all share the specified address/port and can communicate.
Added a method that abstracts the backend to use when running a test. This way, any tests can quickly be switched to run on the backend of choice (c10d, etcd, or etcd-v2)
Test Plan: Tests various components of the elastic agent with 3 different backends: etcd, etcd-v2, and c10d.
Reviewed By: tierex
Differential Revision: D28972604
fbshipit-source-id: fd4cff6417fefdf0de9d7a114820914b968006a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58159
This PR includes the following changes:
- The `--standalone` option of `torch.distributed.run` now uses the `c10d` backend instead of `etcd` backend.
- The `import` statement for `EtcdServer` has been removed from the run script.
- The docstrings and parameter descriptions of the run script have been revised and improved.
- The default port number of `EtcdRendezvousBackend` has been changed from 29500 to 29400 to improve the user experience when used along with the run script which uses the port 29500 for the distributed job store (a.k.a. `MASTER_PORT`) by default.
ghstack-source-id: 128782267
Test Plan:
- Run existing tests.
- Visually verified the correct rendering of the docs.
Reviewed By: tierex
Differential Revision: D28383681
fbshipit-source-id: a4098f7c23c97a2376a9c4023e81f82fedd04b10
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58057
This PR refactors the store initialization logic and moves it to the `create_backend` function for both C10d and etcd backends.
ghstack-source-id: 128671579
Test Plan: Run the existing and revised tests.
Reviewed By: tierex
Differential Revision: D28356587
fbshipit-source-id: caf9416ab811eefe4834268d8a11a48f2236ed5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57764
As discussed offline this PR renames etcd-experimental backend to etcd-v2 and c10d-experimental backend to c10d.
ghstack-source-id: 128342523
Test Plan: Run the existing unit tests.
Reviewed By: kiukchung
Differential Revision: D28263739
fbshipit-source-id: c3409037ecea5a8ff6daadeeb1f2fb4205cc3852
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57151
This PR introduces the implementation of `DynamicRendezvousHandler` that mostly facilitates the types introduced in previous PRs.
ghstack-source-id: 127685212
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28060531
fbshipit-source-id: 844ff0e9c869f2bbb85fba05a16002d00eae130f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57150
This PR refactors the `__init__` method of `DynamicRendezvousHandler` to a `from_backend` static constructor for easier testing and future extensibility.
ghstack-source-id: 127685183
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28060336
fbshipit-source-id: b07dcbb61e8ff5a536b7b021cd50438010c648dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57149
This PR introduces the `_RendezvousJoinOp` type that represents a rendezvous join operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685142
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059785
fbshipit-source-id: 6e67a54289eef1a2349fcc52f8841e49c139459a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57148
This PR introduces the `_RendezvousExitOp` type that represents a rendezvous exit operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685094
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059764
fbshipit-source-id: 2da428885f1390957242fdd82d68cee2ac273c71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57147
This PR introduces the `_RendezvousKeepAliveOp` type that represents a rendezvous keep-alive heartbeat operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127685037
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059733
fbshipit-source-id: 31fd8fc06f03d8f9cd21558b15a06dea7ad85bc6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57146
This PR introduces the `_RendezvousCloseOp` type that represents a rendezvous close operation to be executed via a `_RendezvousOpExecutor`.
ghstack-source-id: 127684991
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059693
fbshipit-source-id: 6c944d3b4f6a6ed2057ea2921ae8a42609998dd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57145
This PR introduces the `_DistributedRendezvousOpExecutor` type that implements the `_RendezvousOpExecutor` interface for rendezvous shared via a `_RendezvousStateHolder`.
ghstack-source-id: 127684945
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D28059417
fbshipit-source-id: 7ef72ea16b54eaaa11a6ece7459d385d49692a84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56538
This PR introduces the `_RendezvousStateHolder` interface and its accompanying `_BackendRendezvousStateHolder` type that is responsible for synchronizing the local rendezvous state with the other nodes.
ghstack-source-id: 127684796
Test Plan: Run the existing and new unit tests.
Reviewed By: tierex
Differential Revision: D27892600
fbshipit-source-id: a55d884a1f9b0d742787be4dff4271e076c08962
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57143
This PR introduces a `name` attribute in `_PeriodicTimer` for testing and debugging purposes.
ghstack-source-id: 127684751
Test Plan: Run the new and updated unit tests.
Reviewed By: tierex
Differential Revision: D28059045
fbshipit-source-id: 9eb067300aea21a99577e6cd8a354f7eb749f4a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57142
This PR extends the return type of `RendezvousBackend`'s `set_state` method with an additional boolean flag that specifies whether the write attempt has succeeded.
ghstack-source-id: 127629538
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058980
fbshipit-source-id: 26333790c39386891beb155b20ba1291d2cbdd03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57141
Per feedback this PR renames `last_keep_alives` to `last_heartbeats` in `_RendezvousState`.
ghstack-source-id: 127629442
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058948
fbshipit-source-id: 0db12eac56a47a426a7a48fb5c93ac6a08b0d22e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57140
This PR introduces a new `heartbeat` attribute in `RendezvousTimeout`.
ghstack-source-id: 127626815
Test Plan: Run the updated unit tests.
Reviewed By: tierex
Differential Revision: D28058908
fbshipit-source-id: c6f8b3a06210cc59714fa841d9387eeb028dc02f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56537
This PR introduces the `RendezvousSettings` type to consolidate the arguments passed to `DynamicRendezvousHandler`.
ghstack-source-id: 127626738
Test Plan: Run the existing unit tests.
Reviewed By: tierex
Differential Revision: D27890155
fbshipit-source-id: 22060c25b6927cc832f18ae6c5f7ba0f7a9ef3cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56536
This PR adds unit tests to ensure that the encoded byte length of `_RendezvousState` stays under a certain limit.
ghstack-source-id: 127626622
Test Plan: Run the newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D27890704
fbshipit-source-id: 24905c8bc9d985d5ee90d370f28739eb137ce0f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56739
The diff makes several tiny changes:
* Add logs for each worker error file destination
* Make sure log_dir is propagated from the launcher
* Make ProcessFailure initialization error non-fatal.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test
https://fburl.com/tupperware/0nizb9z8
Reviewed By: borovsky-d, wilson100hong
Differential Revision: D27952596
fbshipit-source-id: 69582bf4be47758def4008f2abf82d123294cd1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56534
This PR reorders the type definitions in dynamic_rendezvous.py to increase the readability.
ghstack-source-id: 126979087
Test Plan: Run the existing unit tests.
Reviewed By: H-Huang
Differential Revision: D27889817
fbshipit-source-id: 04291af9b8f3170e4b33cb4f33e0dff0d2d3fb23
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56533
This PR introduces a small utility function to delay the execution of the current thread.
ghstack-source-id: 126979035
Test Plan: Run the associated unit tests.
Reviewed By: H-Huang
Differential Revision: D27889671
fbshipit-source-id: aae93b624bd4704da7a48004f50d130cec64969d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56532
This PR fixes a subtle issue with the finalizer implementation of `_PeriodicTimer`.
We avoid using a regular finalizer (a.k.a. `__del__`) for stopping the timer as joining a daemon thread during the interpreter shutdown can cause deadlocks. The `weakref.finalize` is a superior alternative that provides a consistent behavior regardless of the GC implementation.
ghstack-source-id: 126978904
Test Plan: Run the existing unit tests as there is no behavioral change.
Reviewed By: H-Huang
Differential Revision: D27889289
fbshipit-source-id: a248cf6fd1abc4da8bef90e160fa9669a4961fa5