Torch wrapping datasets list has:
`TensorDataset`
`ConcatDataset`
`ChainDataset`
`TensorDataset` is useful for stacking sets of tensors but can't work with objects without `.size()` method.
This PR proposes `StackDataset`, similar to `TensorDataset` but for a general case like `ConcatDataset`.
Possible usage of `StackDataset` is multimodal networks with different input like image+text or for staking non-tensor input and property to predict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101338
Approved by: https://github.com/ejguan, https://github.com/NivekT
As per request from Vision team, adding `collate` function with an extra argument of `collate_fn_map` to dispatch custom collate functions for non-collection objects and specific objects.
If the type of batch element is not present in`collate_fn_map`, it will go through all keys in the insertion order to check if the type is a subclass of the key. If so, it will invoke the corresponding collate functions.
And, `default_collate` will utilize the `collate` function with a few by default collate function for `int`, `float`, `str` and `numpy object`.
Benefit:
- Domain teams can register their own `collate` function to handle their specific type of objects
- Easier for users to extend from the `collate` function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85748
Approved by: https://github.com/NivekT, https://github.com/pmeier
This is the first PR to make DataPipe deterministic.
Users should be able to use `torch.manual_seed(seed)` to control the shuffle order for the following cases:
- Directly over `DataPipe`
- For single-process DataLoader
- Multiprocessing DataLoader
Unfortunately, for distributed training, users have to run `apply_shuffle_seed` manually to make sure all distributed processes having the same order of shuffle.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77741
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396
Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.
I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?
Fixes https://github.com/pytorch/data/issues/213
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D34481962
Pulled By: NivekT
fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
(cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62768
This is part of TorchArrow DF support preparation, separating it to multiple PRs to simplify review process.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D30149090
Pulled By: VitalyFedyunin
fbshipit-source-id: a36b5ff56e2ac6b06060014d4cd41b487754acb8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55836
Change construct_time_validation to argument_validation as we should provide users the flexibility to use this decorator over all different functions, which are required with type validation.
It can also work as a construct-time validation
```py
class ExampleDataPipe(IterDataPipe):
argument_validation
def __init__(self, dp: IterDataPipe[int]):
self.dp = dp
...
```
Notebook is also updated.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27743478
Pulled By: ejguan
fbshipit-source-id: 49743152d121028cd7d72d89dc7df5c7c7b94c41
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54544
## Feature
- Add `subinstance(data, type)` to check `data` is a subtype instance of the `type`
- Add a decorator of `runtime_validation` to validate the returned data from `__iter__` is subtype instance of hint.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27327234
Pulled By: ejguan
fbshipit-source-id: fb6a332762b0fe75284bb2b52a13ed171b42558c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54066
## Feature
- Add a decorator `construct_time_validation` to validate each input datapipe according to the corresponding type hint.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27327236
Pulled By: ejguan
fbshipit-source-id: a9d4c6edb5b05090bd5a369eee50a6fb4d7cf957
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53271
- [x] Add `set_determinism` context manager
- [x] Add `non_deterministic` decorator for `DataPipe`
- Raise error at the construction time for non-deterministic DataPipe when `determinism` is set to `True`
- [ ] Support `non_deterministic` with option
- When `GreedyJoin` only contains one datapipe, it should still be deterministic.
Note: Test is in the [PR](https://github.com/facebookexternal/torchdata/pull/15). As the main repo doesn't have non-deterministic DataPipe yet.
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D26823023
Pulled By: ejguan
fbshipit-source-id: 51bb92fc3d18d1fc9536c1229363c536ad120876
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52141
Remove BufferShuffleDataSet, as it's not being used anywhere within PyTorch (no usage on Github based on a search) and it's not included in the release of PyTorch 1.7.1.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D26710940
Pulled By: ejguan
fbshipit-source-id: 90023b4bfb105d6aa392753082100f9181ecebd0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51126
BucketBatch:
Get a chunk of data as a bucket, and sort the bucket by the specified key, then batching.
If sort key is not specified, directly use batchIterableDS..
1. Implement BucketBatch for bucket sampler
2. Improve BatchDS tests
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D26209890
Pulled By: ejguan
fbshipit-source-id: 8519e2e49da158b3fe32913c8f3cadfa6f3ff1fc
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.
1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.
1. `Iterable` if the `dataset` is an instance of `IterableDataset`
2. `Map` o.w.
3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`
Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228
Reviewed By: bddppq
Differential Revision: D15058152
fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
Same as #14668, and was approved there.
ailzhang , please apply this patch to Horizon's `data_streamer.py`: https://gist.github.com/SsnL/020fdb3d6b7016d81b6ba1d04cc41459 Thank you!
Below is the original description at #14668:
As I am working on tasks in https://github.com/pytorch/pytorch/issues/13023, I realized how unreadable the code is because all functions to be run in multiprocessing must be at top global level. Adding more functionalities to `dataloader.py` will only make things worse.
So in this PR, I refactor `dataloader.py` and move much of it into `data._utils`. E.g., the `_worker_loop` and related methods are now in `data._utils.worker`, signal handling code in `data._utils.signal_handling`, collating code in `data._utils.collate`, etc. This split, IMHO, makes code much clearer. I will base my future changes to DataLoader on top of this.
No functionality is changed, except that I added `torch._six.queue`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15331
Reviewed By: yf225
Differential Revision: D13503120
Pulled By: ailzhang
fbshipit-source-id: 94df16b4d80ad1102c437cde0d5a2e62cffe1f8e
Summary:
As I am working on tasks in https://github.com/pytorch/pytorch/issues/13023, I realized how unreadable the code is because all functions to be run in multiprocessing must be at top global level. Adding more functionalities to `dataloader.py` will only make things worse.
So in this PR, I refactor `dataloader.py` and move much of it into `data._utils`. E.g., the `_worker_loop` and related methods are now in `data._utils.worker`, signal handling code in `data._utils.signal_handling`, collating code in `data._utils.collate`, etc. This split, IMHO, makes code much clearer. I will base my future changes to DataLoader on top of this.
No functionality is changed, except that I added `torch._six.queue`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14668
Reviewed By: soumith
Differential Revision: D13289919
Pulled By: ailzhang
fbshipit-source-id: d701bc7bb48f5dd7b163b5be941a9d27eb277a4c
Summary:
There is no reason that user should do an extra import to use DistributedSampler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10671
Differential Revision: D9395189
Pulled By: SsnL
fbshipit-source-id: 8f41d93813c8fb52fe012f76980c6a261a8db9b2