mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.
Threads named:
* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread
Test plan:
```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45 00:00:02 pt_elastic
3264281 3267950 pts/45 00:00:00 pt_elastic
```
dataloading
```py
import torch
import time
from torch.utils.data import (
DataLoader,
Dataset,
)
class NoopDataset(Dataset):
def __getitem__(self, index):
return index
def __len__(self):
return 10
dataloader = DataLoader(NoopDataset(), num_workers=2)
for i, x in enumerate(dataloader):
print(i, x)
time.sleep(10000)
```
```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45 00:00:02 pt_main_thread
1228312 1230058 pts/45 00:00:00 pt_main_thread
1228312 1230059 pts/45 00:00:00 pt_main_thread
1230052 1230052 pts/45 00:00:00 pt_data_worker
1230052 1230198 pts/45 00:00:00 pt_data_worker
1230052 1230740 pts/45 00:00:00 pt_data_worker
1230055 1230055 pts/45 00:00:00 pt_data_worker
1230055 1230296 pts/45 00:00:00 pt_data_worker
1230055 1230759 pts/45 00:00:00 pt_data_worker
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro
14 lines
186 B
C++
14 lines
186 B
C++
#pragma once
|
|
|
|
#include <string>
|
|
|
|
#include <c10/macros/Export.h>
|
|
|
|
namespace c10 {
|
|
|
|
C10_API void setThreadName(std::string name);
|
|
|
|
C10_API std::string getThreadName();
|
|
|
|
} // namespace c10
|