Tristan Rice
7c370d2fb0
expose set_thread_name to Python and set thread names ( #128448 )
...
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.
Threads named:
* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread
Test plan:
```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45 00:00:02 pt_elastic
3264281 3267950 pts/45 00:00:00 pt_elastic
```
dataloading
```py
import torch
import time
from torch.utils.data import (
DataLoader,
Dataset,
)
class NoopDataset(Dataset):
def __getitem__(self, index):
return index
def __len__(self):
return 10
dataloader = DataLoader(NoopDataset(), num_workers=2)
for i, x in enumerate(dataloader):
print(i, x)
time.sleep(10000)
```
```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45 00:00:02 pt_main_thread
1228312 1230058 pts/45 00:00:00 pt_main_thread
1228312 1230059 pts/45 00:00:00 pt_main_thread
1230052 1230052 pts/45 00:00:00 pt_data_worker
1230052 1230198 pts/45 00:00:00 pt_data_worker
1230052 1230740 pts/45 00:00:00 pt_data_worker
1230055 1230055 pts/45 00:00:00 pt_data_worker
1230055 1230296 pts/45 00:00:00 pt_data_worker
1230055 1230759 pts/45 00:00:00 pt_data_worker
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o , https://github.com/andrewkho , https://github.com/rsdcastro
2024-06-13 16:38:23 +00:00