Commit Graph

100 Commits

Author SHA1 Message Date
Ralf Gommers
bcab2d6848 And type annotations for cpp_extension, utils.data, signal_handling (#42647)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42647

Reviewed By: ezyang

Differential Revision: D22967041

Pulled By: malfet

fbshipit-source-id: 35e124da0be56934faef56834a93b2b400decf66
2020-08-06 09:42:07 -07:00
yl-to
1b55e2b043 add prefetch_factor for multiprocessing prefetching process (#41130)
Summary:
fix https://github.com/pytorch/pytorch/issues/40604
Add parameter to Dataloader to configure the per-worker prefetch number.
Before this edit, the prefetch process always prefetch 2 * num_workers data items, this commit help us make this configurable, e.x. you can specify to prefetch 10 * num_workers data items.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41130

Reviewed By: izdeby

Differential Revision: D22705288

Pulled By: albanD

fbshipit-source-id: 2c483fce409735fef1351eb5aa0b033f8e596561
2020-07-24 08:38:13 -07:00
SsnL
1922f2212a Make IterableDataset dataloader.__len__ warning clearer (#41175)
Summary:
Based on discussion with jlucier (https://github.com/pytorch/pytorch/pull/38925#issuecomment-655859195) . `batch_size` change isn't made because data loader only has the notion of `batch_sampler`, not batch size. If `batch_size` dependent sharding is needed, users can still access it from their own code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41175

Differential Revision: D22456525

Pulled By: zou3519

fbshipit-source-id: 5281fcf14807f219de06e32107d5fe7d5b6a8623
2020-07-09 13:49:29 -07:00
Wojciech Baranowski
0e09511af9 type annotations for dataloader, dataset, sampler (#39392)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39392

Reviewed By: anjali411

Differential Revision: D22102489

Pulled By: zou3519

fbshipit-source-id: acb68d9521145f0b047214d62b5bdc5a0d1b9be4
2020-07-07 07:16:18 -07:00
Tongzhou Wang
019eeb3183 Kill DataLoader worker when we can't join (#39869)
Summary:
There still are occasional reports of DataLoader workers not exiting (e.g., https://github.com/pytorch/pytorch/issues/39570). Before we figure out why, we should just kill them if the join timesout to prevent hanging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39869

Differential Revision: D22018501

Pulled By: ezyang

fbshipit-source-id: 66a00d0f5b3e303b6106b336949176b3ff8ac8ae
2020-06-15 11:18:23 -07:00
ShawnZhong
c8c53c802e Add generator= kwarg for DataLoader & random samplers (#39737)
Summary:
Fix https://github.com/pytorch/pytorch/issues/39572

Add `generator=` kwarg for DataLoader & random samplers

cc: SsnL, deeppatel4557, albanD, mitar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39737

Differential Revision: D22019132

Pulled By: albanD

fbshipit-source-id: 835e08b86c5396bc0b0e41057661306b15394d6e
2020-06-15 07:01:20 -07:00
Daiming Yang
0b90b9cdd3 Allow shuffle when auto-batching disabled in DataLoader (#39865)
Summary:
Fix https://github.com/pytorch/pytorch/issues/35761
cc SsnL

Note: closed the other PR for this new branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39865

Differential Revision: D22003612

Pulled By: ezyang

fbshipit-source-id: 26aecd1b298fe99d3924f4c8157cd6cae2561c7c
2020-06-11 15:17:46 -07:00
Donna Choi
3d2fce6bc3 Change len(DataLoader) for IterableDataset (#38925)
Summary:
Fix https://github.com/pytorch/pytorch/issues/36176

One-liner change to ensure that ```len(loader) == (len(dataset) // batch_size)``` for IterableDataset.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38925

Differential Revision: D21731587

Pulled By: ezyang

fbshipit-source-id: 59a086165a004c0c1c8a1ee0776b1444bd26de23
2020-05-27 11:56:41 -07:00
SsnL
b5868b2833 Relax sampler check in BatchSampler (#38403)
Summary:
Since the check was added in https://github.com/pytorch/pytorch/pull/6249, one can not pass an iterable as a sampler to the data loader anymore, which was a very handy feature (e.g., https://github.com/pytorch/pytorch/issues/1337). I think the check should be removed for two-fold reasons:
1. It is too strict. There is no reason that it should not be a general iterable.
2. It is inconsistent. In `DataLoader` (the main place where people use samplers), you can pass a general iterable as `batch_sampler` but not `sampler` due to this check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38403

Differential Revision: D21555958

Pulled By: soumith

fbshipit-source-id: c7267bb99a31edd8f2750689205d6edc5dab5cff
2020-05-13 22:24:29 -07:00
Wojciech Baranowski
69e3ee2d5f DataLoader: properly diagnose exceeding file descriptor limit (#34768)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/973

Common failure scenario:
* DataLoader creates workers and communicates with them through SHMs
* Workers send back through an AF_UNIX socket file descriptors to SHMs containing data
* The limit of open files gets fully used
* A FD gets stripped from a socket message coming back from a worker, without the worker knowing this.
* This causes a `RuntimeError: received 0 items of ancdata` in the standard `multiprocessing` package
* The exception is not handled by PyTorch and so is presented to the users.

After this change the user will see

```
Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/wbaranowski/git/Quansight/pytorch/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    fd = df.detach()
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 184, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/multiprocessing/reduction.py", line 162, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 787, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(10)]
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 551, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home/wbaranowski/miniconda3/envs/pytorch-cuda-dev/lib/python3.6/tempfile.py", line 262, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpnx_f6v_f'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_shm_leak.py", line 56, in <module>
    worker_init_fn=worker_init_fn
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 861, in _next_data
    idx, data = self._get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 828, in _get_data
    success, data = self._try_get_data()
  File "/home/wbaranowski/git/Quansight/pytorch/torch/utils/data/dataloader.py", line 791, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34768

Differential Revision: D20538053

Pulled By: ezyang

fbshipit-source-id: be4425cf2fa02aff61619b2b829c153cb1a867cb
2020-04-14 07:10:57 -07:00
Hong Xu
817e4f9ef1 Correct a ValueError in dataloader to TypeError (#36244)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36244

Differential Revision: D20963949

Pulled By: ezyang

fbshipit-source-id: 8c6aa4831021788052269e7aa8282d11eba4e085
2020-04-10 09:03:58 -07:00
Tongzhou Wang
4ef854b4b4 Fix potential hang when exiting main process (#33721)
Summary:
The following script reproduces the hang
```py
import multiprocessing, logging
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)

import torch

class Dataset:
    def __len__(self):
        return 23425

    def __getitem__(self, idx):
        return torch.randn(3, 128, 128), idx % 100

ds = Dataset()
trdl = torch.utils.data.DataLoader(ds, batch_size=64, num_workers=300, pin_memory=True, shuffle=True)

for e in range(1000):
    for ii, (x, y) in enumerate(trdl):
        print(f'tr {e: 5d} {ii: 5d} avg y={y.mean(dtype=torch.double).item()}')
        if ii % 2 == 0:
            print("="*200 + "BEFORE ERROR" + "="*200)
            1/0
```

The process will hang at joining the putting thread of `data_queue` in **main process**. The root cause is that too many things are put in the queue from the **worker processes**, and the `put` at 062ac6b472/torch/utils/data/dataloader.py (L928) is blocked at background thread. The `pin_memory_thread` exits from the set `pin_memory_thread_done_event`, without getting the `(None, None)`. Hence, the main process needs the same treatment as the workers did at
062ac6b472/torch/utils/data/_utils/worker.py (L198) .

After the patch, the script finishes correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33721

Differential Revision: D20089209

Pulled By: ezyang

fbshipit-source-id: e73fbfdd7631afe1ce5e1edd05dbdeb7b85ba961
2020-02-25 07:04:41 -08:00
Tongzhou Wang
c37de32b23 Enable len(dataloader) for iterable dataset (#23587)
Summary:
Copy-paste comment from code for reasoning:

```
            # NOTE [ IterableDataset and __len__ ]
            #
            # For `IterableDataset`, `__len__` could be inaccurate when one naively
            # does multi-processing data loading, since the samples will be duplicated.
            # However, no real use case should be actually using that behavior, so
            # it should count as a user error. We should generally trust user
            # code to do the proper thing (e.g., configure each replica differently
            # in `__iter__`), and give us the correct `__len__` if they choose to
            # implement it (this will still throw if the dataset does not implement
            # a `__len__`).
            #
            # To provide a further warning, we track if `__len__` was called on the
            # `DataLoader`, save the returned value in `self._len_called`, and warn
            # if the iterator ends up yielding more than this number of samples.
```

Fixes https://github.com/pytorch/pytorch/issues/30184
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23587

Differential Revision: D18852625

Pulled By: ailzhang

fbshipit-source-id: aea8d4d70c7f21aaa69b35908a6f43026493d826
2019-12-06 15:38:05 -08:00
Nathan Goldbaum
f522bde121 Replace references to _DataLoaderIter with _BaseDataLoaderIter (#27105)
Summary:
Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in https://github.com/pytorch/pytorch/issues/19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring.

I've gone ahead and done the updates to reflect the refactoring in https://github.com/pytorch/pytorch/issues/19228, which fixes the specific type stub/impelementation mismatch pointed out in https://github.com/pytorch/pytorch/issues/26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27105

Differential Revision: D17813641

Pulled By: ezyang

fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
2019-10-08 12:09:02 -07:00
Michael Kuchnik
e5d9a5e5be Fix typo in docs.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26263

Differential Revision: D17397190

Pulled By: ezyang

fbshipit-source-id: 62e3c4c3021c728a3314262528579676d605a81e
2019-09-17 07:46:49 -07:00
SsnL
df9d8f9032 Fix no auto batching bugs: cannot bulk load; not work with namedtuple (#26065)
Summary:
see title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26065

Differential Revision: D17392851

Pulled By: soumith

fbshipit-source-id: 468cd41c8e03d689ff2e0261d948e28daad6bfaf
2019-09-16 07:22:31 -07:00
Tongzhou Wang
928754b67d make more iterator attributes private (#23744)
Summary:
1. Prefixed underscores to any `DataLoaderIter` attribute that is not part of the data loader ctor argument list.
2. Prefixed `DataLoader.dataset_kind` with underscore because it only makes sense with the private enum `_DatasetKind`, and is an implementation detail.
3. Disallow setting `DataLoader.dataset` and `DataLoader.batch_sampler` after initializing a `DataLoader` because they affect other attributes in `__init__`.

These changes should not have major BC breaking effect since the big changes are on the iterator class and most users don't even store it. I GitHub searched `pin_memory_thread` and (while I didn't look through all result pages) results I see are forks of pytorch and blog posts on how data loader works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23744

Differential Revision: D16732507

Pulled By: ezyang

fbshipit-source-id: 9f04d000b4200b8047f31eaa3473780b66cebd26
2019-08-09 11:43:00 -07:00
SsnL
ed19580dc4 Fix dataloader._shutdown_workers if not all workers are started (#23761)
Summary:
Otherwise you may see errors like
```
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x000001F99F5CB9D8>
Traceback (most recent call last):
  File "C:\Users\Divyansh J\Anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 883, in __del__
    self._shutdown_workers()
  File "C:\Users\Divyansh J\Anaconda3\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 860, in _shutdown_workers
    if self.workers_status[worker_id]:
IndexError: list index out of range
```

e.g. https://discuss.pytorch.org/t/how-to-construct-dataset-with-iterator-for-multi-process-dataloader/49612/5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23761

Differential Revision: D16644687

Pulled By: soumith

fbshipit-source-id: a60e847431264525079456ff422317af1ac2be4b
2019-08-07 09:06:11 -07:00
Tongzhou Wang
0539462ca2 Fix pin_memory_thread not exiting quickly (#23646)
Summary:
fixes https://github.com/pytorch/pytorch/issues/23642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23646

Differential Revision: D16600874

Pulled By: soumith

fbshipit-source-id: 50f0828d774a558d6f21e9dd21135906bd5be128
2019-08-01 15:24:14 -07:00
SsnL
e982e46de3 Add multiprocessing_context= argument to DataLoader (#22990)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/22131
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22990

Differential Revision: D16539052

Pulled By: colesbury

fbshipit-source-id: b1c48ae2fb54065dd96a67be263254129e02eaa2
2019-07-29 12:58:40 -07:00
Jan Schlüter
0bc90194fb Catch and print exception traceback in parallel_apply() workers (#18055)
Summary:
When an exception occurs in one of the modules passed to `parallel_apply()`, it is caught and re-raised in the main thread. This preserves the original exception type and message, but has the traceback point at the position where it's re-raised, rather than the original point of failure.

This PR saves the exception information required to generate the traceback, and includes the original traceback in the message of the exception raised in the main thread.

Before:
```
  ...
  File ".../torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File ".../torch/nn/parallel/parallel_apply.py", line 84, in parallel_apply
    raise output
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
```

After:
```
  ...
  File ".../torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File ".../torch/nn/parallel/parallel_apply.py", line 88, in parallel_apply
    ''.join(traceback.format_exception(*exc_info)))
RuntimeError: Caught exception in replica 0. Original traceback and message:
Traceback (most recent call last):
  ...
  File "../models/foo.py", line 319, in bar
    baz = asdf / ghij[:, np.newaxis]
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
```

I took care to raise an exception of the original type (in case the main code checks for that), but replaced the message. It helped me find a bug that did not occur outside `data_parallel()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18055

Differential Revision: D16444972

Pulled By: zhangguanheng66

fbshipit-source-id: ec436c9d4677fad18106a8046cfa835a20a101ce
2019-07-26 11:41:22 -07:00
Tongzhou Wang
e4b75c6580 Fix typo in dataloader.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23132

Differential Revision: D16402759

Pulled By: ezyang

fbshipit-source-id: 9500570f6b7492a67a2af853bfb63a5667e6b7b5
2019-07-23 08:45:47 -07:00
Arul
43d36415b9 torch.utils.data.Dataloader: documentation about RNG state consumption (#22540)
Summary:
the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631

The discussion is here: https://github.com/pytorch/pytorch/pull/20749
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22540

Differential Revision: D16131777

Pulled By: ezyang

fbshipit-source-id: 566deda1b44dc7fae54250e9b508d120851a2848
2019-07-08 08:22:04 -07:00
Tongzhou Wang
058beae411 Add IterableDataset (#19228)
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
2019-06-20 20:12:44 -07:00
jpgard
0556141339 fix small typo muliprocessing -> multiprocessing (#20998)
Summary:
Minor typo fix in docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20998

Differential Revision: D15514698

Pulled By: soumith

fbshipit-source-id: a9ceb557251ff5868e810331195243b6a8717851
2019-05-27 21:36:13 -07:00
Dmytro Dzhulgakov
c25e33789e Lightweight at-most-once logging for API usage (#20745)
Summary:
Resubmit #20698 which got messed up.

Idea is that when PyTorch is used in a custom build environment (e.g. Facebook), it's useful to track usage of various APIs centrally. This PR introduces a simple very lightweight mechanism to do so - only first invocation of a trigger point would be logged. This is significantly more lightweight than #18235 and thus we can allow to put logging in e.g. TensorImpl.

Also adds an initial list of trigger points. Trigger points are added in such a way that no static initialization triggers them, i.e. just linking with libtorch.so will not cause any logging. Further suggestions of what to log are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20745

Differential Revision: D15429196

Pulled By: dzhulgakov

fbshipit-source-id: a5e41a709a65b7ebccc6b95f93854e583cf20aca
2019-05-23 23:17:59 -07:00
Edward Z. Yang
9b1dbffba5
Re-sync with internal repository (#20702) 2019-05-20 09:22:57 -04:00
Dmytro Dzhulgakov
d3059b9c49 Lightweight logging for once-only API usage 2019-05-19 23:04:40 -07:00
Michael Antonov
698103cdd6 DataLoader docs update to describe how workers are managed, including Windows. (#18091)
Summary:
It's been hard to understand how workers are launched and what code runs in the worker vs. main process, especially on Windows, which leads to many of our samples failing. This explains when workers run an how to make code work on Windows as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18091

Differential Revision: D15083766

Pulled By: soumith

fbshipit-source-id: 8a7e60defc8a72ec63874f657d7d5267d951dccf
2019-04-26 16:01:30 -07:00
SsnL
5e62ee2b97 Fix no SIGCHLD checking in DataLoaderIter._shutdown_workers (#19421)
Summary:
Also

1. Bump multiprocessing test timeout following python core tests
2. Fix one type of flakiness in `test_proper_exit`.
3. Add trace reporting when loader process hangs in `test_proper_exit` using `faulthandler`.
3. Give `test_proper_exit` another try.

I'll heavily retest this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19421

Differential Revision: D15063728

Pulled By: ezyang

fbshipit-source-id: 4e0d992622e11053c44a9ec237b88b9a28a4472c
2019-04-24 08:06:58 -07:00
Stas Bekman
c0a2452ffe multiline KeyError msg python bug workaround (#18557)
Summary:
make multiline KeyError msg readable by working around a python bug https://bugs.python.org/issue2651

discussion: https://github.com/pytorch/pytorch/issues/16647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18557

Differential Revision: D14681086

Pulled By: soumith

fbshipit-source-id: acbd13a823302c854c3d364028ed414fd8ce6bc8
2019-03-29 07:04:20 -07:00
Daniel
e5742494f6 Minor typo
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16980

Differential Revision: D14033686

Pulled By: gchanan

fbshipit-source-id: 9f7967defc6795640e14157d0b701b185061741f
2019-02-12 08:02:04 -08:00
Michael Carilli
0742874643 Allow dataloader to accept a custom memory pinning function (#16743)
Summary:
Renewed attempt at https://github.com/pytorch/pytorch/pull/14171

From the original PR:
> Currently, the pin_memory_batch function in the dataloader will return a batch comprised of any unrecognized type without pinning the data, because it doesn't know how.
>
>This behavior was preventing us from overlapping data prefetching in Mask-RCNN, whose custom collate_fn returns a custom batch type.

The old PR allowed the user to implement batch pinning for custom batch and data types by passing a custom pin function to the dataloader.  slayton58 suggested a cleaner approach:  allow the user to define a `pin_memory` method on their custom types, and have `pin_memory_batch` [check for the presence of that method](https://github.com/pytorch/pytorch/pull/16743/files#diff-9f154cbd884fe654066b1621fad654f3R56) in the incoming batch as a fallback.  I've updated the test and docstrings accordingly.

The old PR was merged but then reverted due to weird cuda OOM errors on windows that may or may not have been related.  I have no idea why my changes would cause such errors (then or now) but it's something to keep an eye out for.

fmassa and yf225 who were my POCs on the old PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16743

Differential Revision: D13991745

Pulled By: ezyang

fbshipit-source-id: 74e71f62a03be453b4caa9f5524e9bc53467fa17
2019-02-10 19:37:53 -08:00
SsnL
4aae89fa7b Make test_proper_exit more robust (#16249)
Summary:
1. Improve error message for better debugging info
2. Increase timeout
3. Also apply the windows worker failure detection mechanism on non-Windows platforms, for better robustness

Attempt to fix #14501

cc ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16249

Differential Revision: D13784702

Pulled By: ezyang

fbshipit-source-id: 09a7cff83ab9edce561ed69f9fb555ab35d1275f
2019-01-25 08:25:05 -08:00
SsnL
9b5ec2a076 Fix TestDataLoader.test_proper_exit (#15665)
Summary:
Currently, in `test_proper_exit`,
1. we do not kill the correct input `pid` in the `kill_pid` function
fe15d6a2c2/test/test_dataloader.py (L325-L329)
2. the Windows command that detects process status doesn't actually work
fe15d6a2c2/test/test_dataloader.py (L641-L646)
3. `worker_error` and `worker_kill` cases (sometimes?) are not tested because the workers may exit naturally due to the pre-fetching mechanism and a too small `dataset size / batch size`.

In this PR, I, in separate commits:
1. Install `psutil` (a python package specifically built for process monitoring) on some CI builds. (Linux builds installation are done in https://github.com/pietern/pytorch-dockerfiles/pull/29 https://github.com/pietern/pytorch-dockerfiles/pull/30  https://github.com/pytorch/ossci-job-dsl/pull/36 and https://github.com/pytorch/pytorch/pull/15795).
2. Rewrite `test_proper_exit` with `psutil` so we

    1. do not rely on the hacky `is_process_alive` fe15d6a2c2/test/test_dataloader.py (L640-L653)
   2. increase the #task per worker so `worker_error` and `worker_kill` properly trigger
   3. test error message content to ensure that the loader exits with correct message corresponding to each exiting scenario.

3. Fix Windows data loader not having any mechanism to detect worker failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15665

Differential Revision: D13615527

Pulled By: soumith

fbshipit-source-id: cfb2f67837d2d87928a53f00b4d20f09754b7949
2019-01-10 08:47:27 -08:00
SsnL
9217bde807 Refactor dataloader.py (#15331)
Summary:
Same as #14668, and was approved there.

ailzhang , please apply this patch to Horizon's `data_streamer.py`: https://gist.github.com/SsnL/020fdb3d6b7016d81b6ba1d04cc41459 Thank you!

Below is the original description at #14668:

As I am working on tasks in https://github.com/pytorch/pytorch/issues/13023, I realized how unreadable the code is because all functions to be run in multiprocessing must be at top global level. Adding more functionalities to `dataloader.py` will only make things worse.

So in this PR, I refactor `dataloader.py` and move much of it into `data._utils`. E.g., the `_worker_loop` and related methods are now in `data._utils.worker`, signal handling code in `data._utils.signal_handling`, collating code in `data._utils.collate`, etc. This split, IMHO, makes code much clearer. I will base my future changes to DataLoader on top of this.

No functionality is changed, except that  I added `torch._six.queue`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15331

Reviewed By: yf225

Differential Revision: D13503120

Pulled By: ailzhang

fbshipit-source-id: 94df16b4d80ad1102c437cde0d5a2e62cffe1f8e
2018-12-19 12:36:03 -08:00
Derek Kim
656b565a0f Trivial comment correction in dataloader (#15276)
Summary:
Trivial comment correction in dataloader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15276

Differential Revision: D13477324

Pulled By: soumith

fbshipit-source-id: 2a74a014999655d129311d611f2a09411339cb13
2018-12-15 10:59:00 -08:00
Ailing Zhang
38eb1beff5 Revert D13289919: [pytorch][PR] [DataLoader] Refactor dataloader.py
Differential Revision:
D13289919

Original commit changeset: d701bc7bb48f

fbshipit-source-id: c350c491fefa98a0a7c0cf22cb832e78aeb15c3d
2018-12-04 20:25:16 -08:00
SsnL
16558a1e9d Refactor dataloader.py (#14668)
Summary:
As I am working on tasks in https://github.com/pytorch/pytorch/issues/13023, I realized how unreadable the code is because all functions to be run in multiprocessing must be at top global level. Adding more functionalities to `dataloader.py` will only make things worse.

So in this PR, I refactor `dataloader.py` and move much of it into `data._utils`. E.g., the `_worker_loop` and related methods are now in `data._utils.worker`, signal handling code in `data._utils.signal_handling`, collating code in `data._utils.collate`, etc. This split, IMHO, makes code much clearer. I will base my future changes to DataLoader on top of this.

No functionality is changed, except that  I added `torch._six.queue`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14668

Reviewed By: soumith

Differential Revision: D13289919

Pulled By: ailzhang

fbshipit-source-id: d701bc7bb48f5dd7b163b5be941a9d27eb277a4c
2018-12-04 09:53:41 -08:00
Will Feng
5918de8e84 Revert D13166669: [pytorch][PR] Allow dataloader to accept a custom memory pinning function
Differential Revision:
D13166669

Original commit changeset: ca965f9841d4

fbshipit-source-id: 0836b4f50f73ba01c97491a719660f02e36f20ad
2018-11-26 14:55:04 -08:00
Michael Carilli
7557a993ab Allow dataloader to accept a custom memory pinning function (#14171)
Summary:
Currently, the `pin_memory_batch` function in the dataloader will return a batch comprised of any unrecognized type without pinning the data, because it doesn't know how.

This behavior was preventing us from overlapping data prefetching in Mask-RCNN, whose custom `collate_fn` returns a custom batch type.

The present PR adds the ability for the user to pass a `pin_fn` alongside any custom `collate_fn` to handle such custom types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14171

Differential Revision: D13166669

Pulled By: soumith

fbshipit-source-id: ca965f9841d4a259b3ca4413c8bd0d8743d433ab
2018-11-23 08:12:43 -08:00
Tongzhou Wang
034c969f3c Simply exit DataLoader when Python is dying (#12700)
Summary:
I struggled with yet another DataLoader hang for the entire evening. After numerous experiments, I realized that it is unsafe to do anything when Python is shutting down. We also unfortunately implement our DataLaoder cleaning-up logic in `__del__`, a function that may or may not be called during shutdown, and if called, may or may not be called before core library resources are freed.

Fortunately, we are already setting all our workers and pin_memory_thread as daemonic. So in case of Python shutting down, we can just do a no-op in `__del__` and rely on the automatic termination of daemonic children.

An `atexit` hook is used to detect Python exit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12700

Differential Revision: D10419027

Pulled By: SsnL

fbshipit-source-id: 5753e70d03e69eb1c9ec4ae2154252d51e2f79b0
2018-10-16 22:05:33 -07:00
Tongzhou Wang
11c31aef04 Prevent hanging in data loader altogether
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11985

Differential Revision: D10202374

Pulled By: SsnL

fbshipit-source-id: 1ab1a07185f78a104f9b05930a87ef5a32f431e4
2018-10-09 09:54:19 -07:00
Tongzhou Wang
c30790797f Minor data loader doc improvements
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11821

Differential Revision: D9948292

Pulled By: SsnL

fbshipit-source-id: 01c21c129423c0f7844b403e665a8fe021a9c820
2018-09-19 15:33:25 -07:00
Tongzhou Wang
8e76dcf173 Prevent raising KeyboardInterrupt in worker (#11718)
Summary:
Current behavior is that each process (main and workers) will print trace from `KeyboardInterrupt`. And the main process will also print
```
RuntimeError: DataLoader worker (pid 46045) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with nm_workers=0 may give better error trace.
```
due to our SIGCLD handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11718

Differential Revision: D9840844

Pulled By: SsnL

fbshipit-source-id: 1a05060bb02907fef5aac3f274d2c84f9f42d187
2018-09-14 16:09:35 -07:00
Jeff Smith
05e06f7de2 migrating deprecated calls without abc module for containers (#11515)
Summary:
Implementing #10540.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11515

Reviewed By: apaszke

Differential Revision: D9771045

Pulled By: jeffreyksmithjr

fbshipit-source-id: 85ea39abaa9b465805a969f122b626b11fc85ef6
2018-09-13 15:09:22 -07:00
Tongzhou Wang
57f149a861 Only join pin_memory_thread after it started (#11599)
Summary:
Same reason as in #11432 .

Example error:
```
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa06963cf28>
Traceback (most recent call last):
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 405, in __del__
    self._shutdown_workers()
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 401, in _shutdown_workers
    self.pin_memory_thread.join()
AttributeError: '_DataLoaderIter' object has no attribute 'pin_memory_thread'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11599

Differential Revision: D9801143

Pulled By: SsnL

fbshipit-source-id: 520590a21f56fa381fcac621457a7544d3fba47e
2018-09-13 09:40:49 -07:00
Tongzhou Wang
560d6efd3a Only join started dataloader workers (#11432)
Summary:
`Process.start()` actually take some time as it needs to start a
process and pass the arguments over via a pipe. Therefore, we
only add a worker to self.workers list after it started, so
that we do not call `.join()` if program dies before it starts,
and `__del__` tries to join it but will get:
    AssertionError: can only join a started process.

Example trace when such error happens:
```py
[unrelated]
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 500, in __iter__
    return _DataLoaderIter(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 292, in __init__
    w.start()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
KeyboardInterrupt
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa704d5aa60>
Traceback (most recent call last):
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 398, in __del__
    self._shutdown_workers()
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 392, in _shutdown_workers
    w.join()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 139, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
```

No test because hard to reliably trigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11432

Reviewed By: ezyang

Differential Revision: D9735430

Pulled By: SsnL

fbshipit-source-id: a8912d9bb4063f210d6236267b178173810e2351
2018-09-09 12:55:51 -07:00
Tongzhou Wang
04f381650e Resubmit: Fix dataloader hang when it is not completely iterated (#10366)
Summary:
https://github.com/pytorch/pytorch/pull/9655
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10366

Differential Revision: D9237393

Pulled By: SsnL

fbshipit-source-id: fabfad7f371ba33300098f6b885c0e3f26c3e14a
2018-08-09 00:10:24 -07:00
Tongzhou Wang
a7f183f971 Revert "Fix dataloader hang when it is not completely iterated (#9655)" (#9804)
Summary:
This reverts commit 9ee5133651.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9804

Reviewed By: ezyang

Differential Revision: D8987780

Pulled By: SsnL

fbshipit-source-id: 75ad70b0b8d672d0b35235fa248b187be64b68e5
2018-07-25 10:10:30 -07:00