Commit Graph

87 Commits

Author SHA1 Message Date
Tongzhou Wang
c30790797f Minor data loader doc improvements
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11821

Differential Revision: D9948292

Pulled By: SsnL

fbshipit-source-id: 01c21c129423c0f7844b403e665a8fe021a9c820
2018-09-19 15:33:25 -07:00
Tongzhou Wang
8e76dcf173 Prevent raising KeyboardInterrupt in worker (#11718)
Summary:
Current behavior is that each process (main and workers) will print trace from `KeyboardInterrupt`. And the main process will also print
```
RuntimeError: DataLoader worker (pid 46045) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with nm_workers=0 may give better error trace.
```
due to our SIGCLD handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11718

Differential Revision: D9840844

Pulled By: SsnL

fbshipit-source-id: 1a05060bb02907fef5aac3f274d2c84f9f42d187
2018-09-14 16:09:35 -07:00
Jeff Smith
05e06f7de2 migrating deprecated calls without abc module for containers (#11515)
Summary:
Implementing #10540.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11515

Reviewed By: apaszke

Differential Revision: D9771045

Pulled By: jeffreyksmithjr

fbshipit-source-id: 85ea39abaa9b465805a969f122b626b11fc85ef6
2018-09-13 15:09:22 -07:00
Tongzhou Wang
57f149a861 Only join pin_memory_thread after it started (#11599)
Summary:
Same reason as in #11432 .

Example error:
```
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa06963cf28>
Traceback (most recent call last):
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 405, in __del__
    self._shutdown_workers()
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 401, in _shutdown_workers
    self.pin_memory_thread.join()
AttributeError: '_DataLoaderIter' object has no attribute 'pin_memory_thread'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11599

Differential Revision: D9801143

Pulled By: SsnL

fbshipit-source-id: 520590a21f56fa381fcac621457a7544d3fba47e
2018-09-13 09:40:49 -07:00
Teng Li
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00
Tongzhou Wang
560d6efd3a Only join started dataloader workers (#11432)
Summary:
`Process.start()` actually take some time as it needs to start a
process and pass the arguments over via a pipe. Therefore, we
only add a worker to self.workers list after it started, so
that we do not call `.join()` if program dies before it starts,
and `__del__` tries to join it but will get:
    AssertionError: can only join a started process.

Example trace when such error happens:
```py
[unrelated]
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 500, in __iter__
    return _DataLoaderIter(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 292, in __init__
    w.start()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
KeyboardInterrupt
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa704d5aa60>
Traceback (most recent call last):
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 398, in __del__
    self._shutdown_workers()
  File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 392, in _shutdown_workers
    w.join()
  File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 139, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
```

No test because hard to reliably trigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11432

Reviewed By: ezyang

Differential Revision: D9735430

Pulled By: SsnL

fbshipit-source-id: a8912d9bb4063f210d6236267b178173810e2351
2018-09-09 12:55:51 -07:00
Wei Yang
7f9fd1cc26 allow RandomSampler to sample with replacement (#9911)
Summary:
fixes #7908
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9911

Reviewed By: yf225

Differential Revision: D9023223

Pulled By: weiyangfb

fbshipit-source-id: 68b199bef3940b7205d0fdad75e7c46e6fe65ba7
2018-08-28 10:52:25 -07:00
Chetter2
5ca2713a8b Fix performance of WeightedRandomSampler (#10636)
Summary:
Since https://github.com/pytorch/pytorch/pull/8958 was merged, the BatchSampler samples 0d tensors from WeightedRandomSampler instead of integers. It significantly reduces performance. This PR fix it the same way as https://github.com/pytorch/pytorch/pull/10361 fix DistributedSampler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10636

Differential Revision: D9423869

Pulled By: zou3519

fbshipit-source-id: f94da2d4cccf70e63beea6cfc3d1230b5610ae44
2018-08-22 13:15:48 -07:00
Tongzhou Wang
108b657159 Import DistributedSampler in utils/data/__init__ (#10671)
Summary:
There is no reason that user should do an extra import to use DistributedSampler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10671

Differential Revision: D9395189

Pulled By: SsnL

fbshipit-source-id: 8f41d93813c8fb52fe012f76980c6a261a8db9b2
2018-08-19 16:55:13 -07:00
Alex Sergeev
18d2fcde7a Fix performance of DistributedSampler per #8958
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10361

Differential Revision: D9240798

Pulled By: ezyang

fbshipit-source-id: dc4cfe79612f711bbcff34a147877df6a5f7b89f
2018-08-09 12:54:37 -07:00
Tongzhou Wang
04f381650e Resubmit: Fix dataloader hang when it is not completely iterated (#10366)
Summary:
https://github.com/pytorch/pytorch/pull/9655
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10366

Differential Revision: D9237393

Pulled By: SsnL

fbshipit-source-id: fabfad7f371ba33300098f6b885c0e3f26c3e14a
2018-08-09 00:10:24 -07:00
Tongzhou Wang
a7f183f971 Revert "Fix dataloader hang when it is not completely iterated (#9655)" (#9804)
Summary:
This reverts commit 9ee5133651.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9804

Reviewed By: ezyang

Differential Revision: D8987780

Pulled By: SsnL

fbshipit-source-id: 75ad70b0b8d672d0b35235fa248b187be64b68e5
2018-07-25 10:10:30 -07:00
Tongzhou Wang
9ee5133651 Fix dataloader hang when it is not completely iterated (#9655)
Summary:
second trial of https://github.com/pytorch/pytorch/pull/7140

cc csarofeen Let's see if this works. It passes everything locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9655

Differential Revision: D8940177

Pulled By: SsnL

fbshipit-source-id: 8d6340fc9f7355c71e1e26b262da166402faa158
2018-07-22 20:38:27 -07:00
Dmitriy Serdyuk
ba8e133844 Refactor batch sampler (#8958)
Summary:
Fixes #8652, fixes #8957
Closes https://github.com/pytorch/pytorch/pull/8958

Reviewed By: ezyang

Differential Revision: D8668253

Pulled By: soumith

fbshipit-source-id: 663d461621511166f29cfcc902e6c2a71befa647
2018-06-27 16:06:47 -07:00
tvn
146b951ec5 Fix seeding random module in DataLoader (#7886)
* fix seeding random module

* make base seed int

* follow 0.4 idiom

* add a test for random seeding
2018-05-29 15:55:04 -04:00
Gao, Xiang
d7c32df67f move Subset, random_split to data, use sequence at some places. (#7816) 2018-05-25 12:50:50 +02:00
Gao, Xiang
42e5e12750 make BatchSampler subclass of Sampler, and expose (#7707) 2018-05-19 21:29:03 +02:00
Maxim Berman
03767b66db Add FileNotFoundError to torch._six (#7524)
Add FileNotFoundError for compatibility with Python 2 and use in
dataloader. Fixes pytorch/pytorch#6932
2018-05-12 20:54:26 -04:00
vfdev
6363faf184 Fix issue #7209 in DataLoader (#7265) 2018-05-04 10:51:46 +02:00
Thomas Viehmann
1b0ad8678b import *Sampler to utils.data (Better fix than #6982) (#7007) 2018-04-27 10:18:29 +02:00
Thomas Viehmann
5dc5a71d74 Improve error message (Sampler location) Fixes #6917 (#6982)
Thank you @ruotianluo for reporting!
2018-04-26 10:58:27 -04:00
Will Feng
e8bdbdaa27 Terminate dataloader workers properly when parent process is SIGKILL'ed (#6779)
Reopening #6606 with fix for TEST_CUDA import issue on Windows and improvement to how we wait for manager exit in test_manager_unclean_exit. Loop tested on the Windows CI multiple times to make sure this actually fixes the CUDA OOM issue.

* Terminate dataloader workers properly when parent process is SIGKILL'ed

* Wait for worker processes to finish before shutting down manager process

* Add test for checking proper worker exit

* cosmetic change

* Test only if CUDA exists

* Don't call multiprocessing.set_start_method() in Python 2

* import TEST_CUDA only when we are in __main__

* Tune JOIN_TIMEOUT

* handle os.getppid() == 0 case

* Reset to original JOIN_TIMEOUT

* Use WaitForSingleObject() to check parent process status on Windows

* Fix TEST_CUDA import

* clean up

* Check main process only when index_queue.get() times out

* Change index_queues to multiprocessing.Queue

* Move manager checking logic to watchdog class

* Fix bugs in dataloader

* Fix TEST_CUDA import issue

* Don't import TEST_CUDA from common_nn

* Use event to signal manager exit in test

* fix lint

* Add comments
2018-04-22 23:03:54 -04:00
gchanan
4c5b95a433
Revert "Terminate dataloader workers properly when parent process is SIGKILL'ed (#6606)" (#6772)
This reverts commit 8d6a50aaeb.
2018-04-19 14:28:48 -04:00
Tongzhou Wang
072d49f787 Fix import error sometimes happening in dataloader when exiting Python (#6671)
* Fix import error sometimes happening in dataloader when exiting Python

* address comments
2018-04-19 06:56:39 -04:00
Will Feng
8d6a50aaeb
Terminate dataloader workers properly when parent process is SIGKILL'ed (#6606)
* Terminate dataloader workers properly when parent process is SIGKILL'ed

* Wait for worker processes to finish before shutting down manager process

* Add test for checking proper worker exit

* cosmetic change

* Test only if CUDA exists

* Don't call multiprocessing.set_start_method() in Python 2

* import TEST_CUDA only when we are in __main__

* Tune JOIN_TIMEOUT

* handle os.getppid() == 0 case

* Reset to original JOIN_TIMEOUT

* Use WaitForSingleObject() to check parent process status on Windows

* Fix TEST_CUDA import

* clean up

* Check main process only when index_queue.get() times out

* Change index_queues to multiprocessing.Queue

* Move manager checking logic to watchdog class

* Fix bugs in dataloader

* Fix TEST_CUDA import issue
2018-04-18 20:41:33 -04:00
Tongzhou Wang
1c01eabd3c
Codemod to update our codebase to 0.4 standard (#6641)
* Codemod to update our codebase to 0.4 standard

* Update some of the test scri[ts

* remove Variable in test_clip_grad_value

* fix _symbolic_override_wrapper_maker
2018-04-17 22:06:54 -04:00
Sanjeev Satheesh
f15f3ca1af Scope variables inside the dataloader (#6673)
* Scope variables inside the dataloader

This clears up the memory consumed by batches inside the dataloader. Its pretty useful for long living data loaders.

* Update dataloader.py
2018-04-17 17:48:12 -04:00
Tongzhou Wang
1f0b07cddc fix typos in sampler.py (#6525) 2018-04-11 17:27:25 -04:00
Tongzhou Wang
6b7ec95abb Link relevant FAQ section in DataLoader docs (#6476)
* Link FAQ section on workers returning same random numbers in DataLoader docs

* explicitly mention section names
2018-04-11 13:41:46 -04:00
Tongzhou Wang
efc91d8c6d Add arg checks in torch.utils.data.Sampler classes (#6249)
Fixes #6168

* add arg checks in torch.utils.data.Sampler

* add check for positive-ness
2018-04-04 23:07:31 -04:00
Tongzhou Wang
60a16e5663 Set dataloader.batch_size = None when batch_sampler is given (#6108) 2018-03-30 10:01:09 +02:00
Jason Park
64e2c03bea Enable TensorDataset to get any number of tensors (#6038)
Keeping compatibility, enable TensorDataset to get any number of tensors.

* Enable TensorDataset to get any number of tensors

* Update dataset.py

Fix syntax error on python 2.7

* Add several test for tensordataset

* Fix whitespaces

* Simplify args

* Update dataset.py
2018-03-28 11:20:50 -04:00
peterjc123
ae4362bc6a Fix memory leak when using multiple workers on Windows (#5585) 2018-03-28 10:35:28 +02:00
AlexanderRadionov
831780390c Fixed non-determinate preprocessing on DataLoader (#4640)
dded ind_worker_queue parameter to data.DataLoader. It makes preprocessing determinate.

DataLoader in multiprocessing mode may cause non-deterministic issue. Even if radom_seed has frozen, each subprocess may get tasks in unstable order. This is caused by different I/O time while data loads. If you use augmentation while data loading, it makes results unreproduceble. Look at the https://discuss.pytorch.org/t/deterministic-non-deterministic-results-with-pytorch/9087

To fix this issue I have added the individual queue for each worker. In this case each worker get tasks in the stable order. In summary, subprocess produces the stable results.

To reproduce issue you may change ind_worker_queue to False and run the script several times.
Code to reproduce issue is in the corresponding PR.

* TestIndividualWorkerQueue added to DataLoader tests

* Review fixes

* "Simplify" code by removing itertools

* Rebase conflicts fix

* Review fixes

* Fixed shutdown behavior

* Removed ind_worker_queue flag.

* Rebase on master

* Disable tests that use DataLoader with multiple workers (#5322)
2018-03-23 17:43:59 -04:00
Tongzhou Wang
04461fa289 Prefix DataLoaderIter with underscore to discourage subclassing (#5619) 2018-03-08 11:09:51 +01:00
Will Feng
a90b695590 Disallow num_workers > 0 for DataLoader on Windows (#5591)
Using DataLoader with num_workers > 0 is known to cause CUDA out-of-memory issue on Windows.

This issue has already been noted in #4092.
2018-03-07 10:21:03 -05:00
Tongzhou Wang
392fc8885c add faq on cuda memory management and dataloder (#5378) 2018-02-27 18:35:30 -05:00
Achal Dave
8327982904 Set python random seed in workers (#5415)
* Set python random seed in workers

* Import random
2018-02-27 03:16:10 -05:00
Tongzhou Wang
1ff537ca71 Ignore FileNotFoundError when shutting down in data_queue.get (#5380)
* Ignore FileNotFoundError when shutting down in data_queue.get

* Address @apaszke comments
2018-02-24 13:32:13 -05:00
Sam Gross
30ec06c140
Merge Variable and Tensor classes (#5225)
This replaces the torch.Tensor constructors with factories that produce
Variables. Similarly, functions on the torch module (e.g. torch.randn)
now return Variables.

To keep the PR to a reasonable size, I've left most of the unused tensor
code. Subsequent PRs will remove the dead code, clean-up calls to
torch.autograd.Variable, and rename Variable to Tensor everywhere.

There are some breaking changes because Variable and Tensors had
slightly different semantics. There's a list of those changes here:

 https://github.com/pytorch/pytorch/wiki/Breaking-Changes-from-Variable-and-Tensor-merge
2018-02-23 18:03:31 -05:00
Tongzhou Wang
64a9ecae02 Dataloader issues (#4643)
* EINTR and kill by loader fix

* addressed @apaszke 's comments

* remove EINTR handling and add test if we are in main thread before setting SIGCHLD
2018-01-29 01:18:17 +01:00
Tongzhou Wang
0ac58d53b8 ATen conv param expansion; InstanceNorm use_running_stats fix (#4544)
* fix instancenorm and aten conv param expansion

* addressed colesbury 's comments

* improve conv input shape check
2018-01-10 17:36:26 -05:00
Alykhan Tejani
18a866aedd Add random_split to torch.utils.data.dataset (#4435) 2018-01-02 18:56:49 +01:00
Christian Sarofeen
bc6bd62bd6 Fix distributed dataloader so it pins memory to current GPU not GPU 0. 2017-12-19 13:39:06 +01:00
Tongzhou Wang
5cc26c0c90 Add default PyTorch seeding and worker_init_fn to DataLoader (#4018)
* Add default PyTorch seeding and worker_init_fn to DataLoader

* generate seed using current RNG each time

* worker_seed <- main_proc_RNG_generated_seed + worker_id
2017-12-18 02:19:08 -05:00
Jon Crall
5c13c6962c Raise errors when num_workers == 0 in DataLoader (#4019) 2017-12-05 11:07:43 -08:00
Alykhan Tejani
5571d0187e Accept longs in default_collate for dataloader in python 2 (#4001) 2017-12-04 09:50:57 -08:00
SsnL
1661370ac5 Signal handling in DataLoader workers; Timeout option (#3474) 2017-11-29 23:52:14 +01:00
Mikhail Korobov
754f3d3fe8 fixed a typo in ConcatDataset.cumulative_sizes attribute name 2017-11-24 11:07:51 +01:00
Ozan Çağlayan
dd6d04ddf2 doc: Normalize all true/false in docstrings to `True|False` (#3593)
* doc: Normalize all true/false in docstrings to ``True|False``

This makes them more apparent in the documentation.

* doc: fix flake8
2017-11-09 08:12:29 -05:00