Commit Graph

32 Commits

Author SHA1 Message Date
PyTorch MergeBot
ecde622649 Revert "reseed all Generators in Dataloader's _worker_loop() -- via GC (#107131)"
This reverts commit 42625da5e1.

Reverted https://github.com/pytorch/pytorch/pull/107131 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107131#issuecomment-1690325745))
2023-08-23 17:08:07 +00:00
Nicolas Hug
42625da5e1 reseed all Generators in Dataloader's _worker_loop() -- via GC (#107131)
Alternative to https://github.com/pytorch/pytorch/pull/107034, implements @ezyang 's suggestion from https://github.com/pytorch/pytorch/pull/107034#discussion_r1292857201.

This PR addresses https://fb.workplace.com/groups/pytorch.oss.dev/posts/1699944830430051 and does a bunch of stacked changes:

- Make `Generator` class support GC;this makes all `Generator` instances tracked and accessile through Python's GC.
- Use the GC to retrieve all existing Generator instances in Dataloader's `_worker_loop` and re-seed them: this extends what is already applied to the global/default Generator, which is already re-seeded.

~TODO: a bit of docs and justification, which I'll do if this PR is mergeable.~ -- Done

CC @albanD @ezyang  as previously discussed

BC-Breaking Note
-------------------

We now re-seed all `Generator` instances within the `Dataloader` workers' loop to ensure that their RNG is different across workers.
Previously, the RNG of user-defined `Generators` would be the same across workers, which could lead to wrong training procedures. This only affects user-defined `Generators`, not the default `Generator` (which was already re-seeded).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107131
Approved by: https://github.com/ezyang
2023-08-18 10:23:23 +00:00
Ramil Nugmanov
2ae87a1f87 missed StackDataset documentation (#101927)
New dataset class added by #101338 missed in documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101927
Approved by: https://github.com/kit1980
2023-05-22 21:12:16 +00:00
Kevin Tse
be8d88f8d0 [DataLoader] Removing DataLoader2 related code (#88848)
Removing these lines of code as `DataLoader2` has been added to [TorchData](https://github.com/pytorch/data). I'm importing this to confirm it will not impact internal codes.

Differential Revision: [D41201578](https://our.internmc.facebook.com/intern/diff/D41201578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88848
Approved by: https://github.com/ejguan
2022-11-11 22:27:01 +00:00
Kazuaki Ishizaki
7d2f1cd211 Fix typos under docs directory (#88033)
This PR fixes typos in `.rst` and `.Doxyfile` files under docs directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88033
Approved by: https://github.com/soulitzer
2022-10-31 19:31:56 +00:00
erjia
b13b10a8fa Extend collate function that can register collate functions to handle specific types (#85748)
As per request from Vision team, adding `collate` function with an extra argument of `collate_fn_map` to dispatch custom collate functions for non-collection objects and specific objects.
If the type of batch element is not present in`collate_fn_map`, it will go through all keys in the insertion order to check if the type is a subclass of the key. If so, it will invoke the corresponding collate functions.

And, `default_collate` will utilize the `collate` function with a few by default collate function for `int`, `float`, `str` and `numpy object`.

Benefit:
- Domain teams can register their own `collate` function to handle their specific type of objects
- Easier for users to extend from the `collate` function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85748
Approved by: https://github.com/NivekT, https://github.com/pmeier
2022-09-30 13:30:18 +00:00
Kevin Tse
51ecc366e1 [DataLoader] Minor documentation improvement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78404

Approved by: https://github.com/ejguan
2022-05-31 15:59:46 +00:00
Alban Desmaison
734281c3d6 Cleanup all module references in doc (#73983)
Summary:
Working towards https://docs.google.com/document/d/10yx2-4gs0gTMOimVS403MnoAWkqitS8TUHX73PN8EjE/edit?pli=1#

This PR:
- Ensure that all the submodules are listed in a rst file (that ensure they are considered by the coverage tool)
- Remove some long deprecated code that just error out on import
- Remove the allow list altogether to ensure nothing gets added back there

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73983

Reviewed By: anjali411

Differential Revision: D34787908

Pulled By: albanD

fbshipit-source-id: 163ce61e133b12b2f2e1cbe374f979e3d6858db7
(cherry picked from commit c9edfead7a01dc45bfc24eaf7220d2a84ab1f62e)
2022-03-10 22:26:29 +00:00
Kevin Tse
b67eaec853 [DateLoader] more clearly expose 'default_collate' and 'default_convert' to users (#69862)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69862

Fixes #69445

cc SsnL VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan, ngimel

Differential Revision: D33068792

Pulled By: NivekT

fbshipit-source-id: ef9791acdc23d014b8761fa7420062d454ce8969
2021-12-14 11:18:26 -08:00
Edward Yang
71e149834b Add a warning about DataLoader num_workers > 0 "memory leak" (#64337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64337

See https://github.com/pytorch/pytorch/issues/13246

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D30690320

Pulled By: ezyang

fbshipit-source-id: 2751aca05a94e63d25162599f458855988516fad
2021-09-01 21:49:41 -07:00
Erjia Guan
8cf85a1152 [DataLoader][doc] Randomness for base_seed generator and NumPy seed (#56528)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56528

Tried to search across internal and external usage of DataLoader. People haven't started to use `generator` for `DataLoader`.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27908487

Pulled By: ejguan

fbshipit-source-id: 14c83ed40d4ba4dc988b121968a78c2732d8eb93
2021-04-22 09:40:45 -07:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Erjia Guan
89b1053413 [DataLoader] Move BufferedShuffle from Dataset to DataPipe (#52141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52141

Remove BufferShuffleDataSet, as it's not being used anywhere within PyTorch (no usage on Github based on a search) and it's not included in the release of PyTorch 1.7.1.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26710940

Pulled By: ejguan

fbshipit-source-id: 90023b4bfb105d6aa392753082100f9181ecebd0
2021-03-01 12:54:44 -08:00
Vitaly Fedyunin
31ee5d8d8b Adding information how to control randomness with DataLoader (#45749)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45749

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24088407

Pulled By: VitalyFedyunin

fbshipit-source-id: 398b73ec5e8c83000ebc692001da847fc0aaa48f
2020-10-12 16:57:58 -07:00
Erjia Guan
96540e918c Add ShuffleDataset with buffer (#45290)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45290

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D24001084

Pulled By: erjia-guan

fbshipit-source-id: d8a7455cf3f18e1f8c1edc53c42c1a99c8573c51
2020-09-30 07:58:15 -07:00
Emilio Castillo
5472426b9f Reset DataLoader workers instead of creating new ones (#35795)
Summary:
This PR needs discussion as it changes the behavior of `DataLoader`. It can be closed if its not considered a good practice.

Currently, the `DataLoader` spawns a new `_BaseDataLoaderIter` object every epoch,
In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original `Dataset` object.
If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders.

This PR keeps the `_BaseDataLoaderIter` object alive and just resets it within epochs, so the workers remain active and so their own `Dataset` objects. People seem to file issues about this often.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35795

Reviewed By: ailzhang

Differential Revision: D23426612

Pulled By: VitalyFedyunin

fbshipit-source-id: e16950036bae35548cd0cfa78faa06b6c232a2ea
2020-09-01 11:48:00 -07:00
yl-to
1b55e2b043 add prefetch_factor for multiprocessing prefetching process (#41130)
Summary:
fix https://github.com/pytorch/pytorch/issues/40604
Add parameter to Dataloader to configure the per-worker prefetch number.
Before this edit, the prefetch process always prefetch 2 * num_workers data items, this commit help us make this configurable, e.x. you can specify to prefetch 10 * num_workers data items.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41130

Reviewed By: izdeby

Differential Revision: D22705288

Pulled By: albanD

fbshipit-source-id: 2c483fce409735fef1351eb5aa0b033f8e596561
2020-07-24 08:38:13 -07:00
Samuel
b039bca4db Fix typo in data.rst (#34624)
Summary:
Fix minor typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34624

Differential Revision: D20401946

Pulled By: ngimel

fbshipit-source-id: 0c6a7d838aa15120b3ecb8b9ba4b57550c9bcd32
2020-03-11 19:40:18 -07:00
Elliot Waite
c63f8e5ebe Fix typo in data.rst docs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31395

Differential Revision: D19160010

Pulled By: zou3519

fbshipit-source-id: cbc4e719e69117e8747617729d240c72e7a4e3dd
2019-12-18 09:52:10 -08:00
Tongzhou Wang
336c9be7f4 Slightly improve dataloader docs on when auto-batching is disabled (#23671)
Summary:
cc gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23671

Differential Revision: D16604387

Pulled By: soumith

fbshipit-source-id: 0ebc120bcaa0f6fa09158b1d0459a72ab11a53d6
2019-08-01 12:10:17 -07:00
Arul
43d36415b9 torch.utils.data.Dataloader: documentation about RNG state consumption (#22540)
Summary:
the outcome from the pytorch forum issue: https://discuss.pytorch.org/t/dataloader-problem-problem-arises-when-shuffle-true/45631

The discussion is here: https://github.com/pytorch/pytorch/pull/20749
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22540

Differential Revision: D16131777

Pulled By: ezyang

fbshipit-source-id: 566deda1b44dc7fae54250e9b508d120851a2848
2019-07-08 08:22:04 -07:00
Tongzhou Wang
058beae411 Add IterableDataset (#19228)
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.

1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.

    1. `Iterable` if the `dataset` is an instance of `IterableDataset`
    2. `Map` o.w.

3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`

Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228

Reviewed By: bddppq

Differential Revision: D15058152

fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
2019-06-20 20:12:44 -07:00
Thomas Viehmann
0ae8b6c027 add fold example and add nn.Fold/nn.Unfold and F.fold/F.unfold to doc (#8600)
* add fold example and add nn.Fold/nn.Unfold and F.fold/F.unfold to doc

and a few drive-by doc fixes

* typo
2018-06-18 09:36:42 -04:00
Gao, Xiang
d7c32df67f move Subset, random_split to data, use sequence at some places. (#7816) 2018-05-25 12:50:50 +02:00
Gao, Xiang
42e5e12750 make BatchSampler subclass of Sampler, and expose (#7707) 2018-05-19 21:29:03 +02:00
Richard Zou
e37da05bd5 Expose documentation for random_split (#7676)
Fixes #7640
2018-05-18 17:16:25 +02:00
Thomas Viehmann
1b0ad8678b import *Sampler to utils.data (Better fix than #6982) (#7007) 2018-04-27 10:18:29 +02:00
Sasank Chilamkurthy
5caa42b538 Add ConcatDataset to docs (#2337) 2017-08-08 07:16:04 -04:00
Sam Gross
9c53c6dcb9 Fix errors and warnings when building docs (#1806) 2017-06-14 13:50:14 -04:00
Adam Paszke
12813b88f6 Add DistributedDataParallel 2017-06-12 22:00:22 -04:00
Soumith Chintala
22b3600f19 add samplers to documentation 2017-03-29 00:33:07 -04:00
Sam Gross
126a1cc398 Add Sphinx docs 2016-12-28 00:03:39 +01:00