Commit Graph

5 Commits

Author SHA1 Message Date
Will Constable
0344f95c2e Add missing #include <array> to thread_name.cpp (#128664)
I got local compile errors (using clang 14.0.6) due to this missing include after pulling the
latest pytorch main.  It's totally puzzling why CI appears to pass
without this fix.  Hopefully someone else will have an idea if we are
missing some CI coverage or if I am using a strange build setup locally.

The PR introducing the compile errors was https://github.com/pytorch/pytorch/pull/128448.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128664
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/d4l3k
2024-06-14 07:49:09 +00:00
Tristan Rice
7c370d2fb0 expose set_thread_name to Python and set thread names (#128448)
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.

Threads named:

* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread

Test plan:

```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45   00:00:02 pt_elastic
3264281 3267950 pts/45   00:00:00 pt_elastic
```

dataloading

```py
import torch
import time

from torch.utils.data import (
    DataLoader,
    Dataset,
)

class NoopDataset(Dataset):
    def __getitem__(self, index):
        return index

    def __len__(self):
        return 10

dataloader = DataLoader(NoopDataset(), num_workers=2)

for i, x in enumerate(dataloader):
    print(i, x)
    time.sleep(10000)
```

```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45   00:00:02 pt_main_thread
1228312 1230058 pts/45   00:00:00 pt_main_thread
1228312 1230059 pts/45   00:00:00 pt_main_thread
1230052 1230052 pts/45   00:00:00 pt_data_worker
1230052 1230198 pts/45   00:00:00 pt_data_worker
1230052 1230740 pts/45   00:00:00 pt_data_worker
1230055 1230055 pts/45   00:00:00 pt_data_worker
1230055 1230296 pts/45   00:00:00 pt_data_worker
1230055 1230759 pts/45   00:00:00 pt_data_worker
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro
2024-06-13 16:38:23 +00:00
Maxim Belkin
13c975684a c10/util/thread_name.cpp: pthread_setname_np requires Glibc 2.12 (#55063)
Summary:
`pthread_setname_np` requires Glibc 2.12. The patch reproduces what numactl does: 93867c59b0/syscall.c (L132-L136)

Related to issue https://github.com/pytorch/pytorch/issues/23482 and the `pthread_setname_np.patch` patch that adamjstewart shared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55063

Reviewed By: soulitzer

Differential Revision: D28577146

Pulled By: malfet

fbshipit-source-id: 85867b6f04795b1ae7bd46dbbc501cfd0ec9f163
2021-05-21 10:26:51 -07:00
Xiang Gao
15c7486416 Canonicalize includes in c10, and add tests for it (#36299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36299

Test Plan: Imported from OSS

Differential Revision: D20943005

Pulled By: ezyang

fbshipit-source-id: 9dd0a58824bd0f1b5ad259942f92954ba1f63eae
2020-04-10 12:07:52 -07:00
James Reed
1d26a3ae7e Open registration for c10 thread pool (#17788)
Summary:
1. Move ATen threadpool & open registration mechanism to C10
2. Move the `global_work_queue` to use this open registration mechanism, to allow users to substitute in their own
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17788

Reviewed By: zdevito

Differential Revision: D14379707

Pulled By: jamesr66a

fbshipit-source-id: 949662d0024875abf09907d97db927f160c54d45
2019-03-08 15:38:41 -08:00