Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277
Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485823
Pulled By: NivekT
fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275
Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485822
Pulled By: NivekT
fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65220Fixes#65221
- Remove deepcopy from Mapper to support file handles
- Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch)
- Convert `IDP` to `IterableWrapper` in test_datapipe.py
- Refine the variable names (prevent using `dp` that is module reference)
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D31021886
Pulled By: ejguan
fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223
Summary:
ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit).
This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.
It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.
The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream.
Stack:
* __->__ https://github.com/pytorch/pytorch/issues/64788
* https://github.com/pytorch/pytorch/issues/64786
cc VitalyFedyunin ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788
Reviewed By: jbschlosser, ejguan
Differential Revision: D30901176
Pulled By: NivekT
fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e
Summary:
There are two warnings produced by `test_fork_datapipe`. This PR addresses the issues raised by those warnings without impacting the test cases.
cc VitalyFedyunin ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64827
Reviewed By: ejguan
Differential Revision: D30870528
Pulled By: NivekT
fbshipit-source-id: 580a001c6fa3ff6f8b04a7e5183e58861938204b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404
This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like.
cc VitalyFedyunin ejguan
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D30713120
Pulled By: NivekT
fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64220
Remove `ByKeyGrouperIterDataPipe` due to duplicated functionality.
Fix a bug in `GrouperIterDataPipe` using the existing test.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D30650542
Pulled By: ejguan
fbshipit-source-id: 666b4d28282fb4f49f3ff101b8d08be16a50d836
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422Fixes#63095
Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D30379027
Pulled By: ejguan
fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62768
This is part of TorchArrow DF support preparation, separating it to multiple PRs to simplify review process.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D30149090
Pulled By: VitalyFedyunin
fbshipit-source-id: a36b5ff56e2ac6b06060014d4cd41b487754acb8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61312
Sorting according to isort output. Alphabetically ordered one per line imports help merging.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D29588833
Pulled By: VitalyFedyunin
fbshipit-source-id: 4c80c3086132b50894e734ad6c5799d78d689e42
Summary:
As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the ConcatMapDataPipe functional datapipe for the MapDataPipe class.
We may need to discuss how to treat the datapipes with no valid length. For now, I just use them as if they have infinite length and the `__getitem__` could not go pass them.
Thank you for your time on reviewing this~
cc ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61010
Reviewed By: soulitzer
Differential Revision: D29587679
Pulled By: ejguan
fbshipit-source-id: 5eb97fa727209bec6c534520057c64a78000626e
Summary:
Fixes issues that are discussed with ezyang in the comments of PR https://github.com/pytorch/pytorch/issues/59498
Improved code and documentation clarity, and refactored .filter to nesting_level directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60423
Reviewed By: ezyang
Differential Revision: D29281599
Pulled By: NivekT
fbshipit-source-id: a9bbaf52f492db0741c00f3ceb4022b08ddb1506
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59816
Add two new DataPipes, one for getting web file urls to yield streams and one for getting streams to yield bytes.
Test Plan:
Add test_web_iterable_datapipe in test/test_datapipes.py. The test initiates a local http server for serving test files. Test below locally ok.
1. create and load 16M localhost file urls (each of size 10 Bytes)
2. create and load a 64GB localhost file
in the unit test, for sake of testing time, disabling both stress test and large file test
Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D29051186
fbshipit-source-id: f8e44491e670560bf445af96f94d98230436f396
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58938
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream
- Modify `test_datapipe.py` to use context manager
Small fix:
- Reorder import in `test_datapipe.py`
Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D28689862
Pulled By: ejguan
fbshipit-source-id: bb2a85defb8a4ab5384db902ef6ad062185c2653
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55836
Change construct_time_validation to argument_validation as we should provide users the flexibility to use this decorator over all different functions, which are required with type validation.
It can also work as a construct-time validation
```py
class ExampleDataPipe(IterDataPipe):
argument_validation
def __init__(self, dp: IterDataPipe[int]):
self.dp = dp
...
```
Notebook is also updated.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27743478
Pulled By: ejguan
fbshipit-source-id: 49743152d121028cd7d72d89dc7df5c7c7b94c41
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57824
Implement type check for string type. Re-raise detailed exception at compile time.
```py
>>> class InvalidData(Generic[T_co], NamedTuple): # Invalid generic namedtuple in Python typing
... name: str
... data: T_co
class DP(IterDataPipe['InvalidData[int]']):
... pass
TypeError: InvalidData[int] is not supported by Python typing
```
Add `__type_class__` attribute to class, which optimizes the static checking flow by reducing checking times.
```py
>>> class DP1(IterDataPipe[Union[int, str]]):
... pass
>>> class DP2(DP1[int]):
... pass
>>> list((cls, getattr(cls, '__type_class__', None)) for cls in DP2.__mro__)
[(<class '__main__.DP2'>, False), (<class 'abc.DP1[int]'>, True), (<class '__main__.DP1'>, False), (<class 'abc.IterableDataset[typing.Union[int, str]]'>, True), (<class 'torch.utils.data.dataset.IterableDataset'>, False), (<class 'torch.utils.data.dataset.Dataset'>, None), (<class 'typing.Generic'>, None), (<class 'object'>, None)]
```
Among the class of `DP2`'s MRO, only `DP2`, `DP1` will be static checked when `__type_class__` is `False`. `abc.DP1[int]` and `abc.IterableDataset[typing.Union[int, str]]` will be ignored since they are just a class with typing.
## Future
When Python 3.6 is deprecated, using TypeAlias rather than TypeMeta can eliminates the usage of `__type_class__` attribute.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D28289104
Pulled By: ejguan
fbshipit-source-id: 1da97460c8bfc48cea7396033fde484a24caba7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54544
## Feature
- Add `subinstance(data, type)` to check `data` is a subtype instance of the `type`
- Add a decorator of `runtime_validation` to validate the returned data from `__iter__` is subtype instance of hint.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27327234
Pulled By: ejguan
fbshipit-source-id: fb6a332762b0fe75284bb2b52a13ed171b42558c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54066
## Feature
- Add a decorator `construct_time_validation` to validate each input datapipe according to the corresponding type hint.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27327236
Pulled By: ejguan
fbshipit-source-id: a9d4c6edb5b05090bd5a369eee50a6fb4d7cf957
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54020
## Feature
- Add `issubtype` to check the type is a subtype of the other type.
- Add `_DataPipeMeta` (mimic Python typing 3.6)
- Add `type` attribute for each DataPipe
- Save original `__init__` function for each DataPipe
- Validate return hint of `__iter__`
- Replace `__init__` function bases on `type`
- Fixed type: Put original `__init__` back, if it exists or use a plain `__init__`
- Non-fixed type: Add new `__init__` with the functionality to copy `cls.type` for each instance. (Optimized for memory)
No Error for main repo, `torchvision`, `torchaudio` and `torchtext`.
## Future
- Add same thing for `__getitem__`.
- When DataFrame came out, add an another type for DataFrame with column name and type.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27327232
Pulled By: ejguan
fbshipit-source-id: fd3a6029c16f5d814b1d7e1b1566fdcd8fd1ad9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54299
## Feature
- Check type is a subtype of another type
Prerequisite for DataPipe tying system.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D27327235
Pulled By: ejguan
fbshipit-source-id: 8f50a663a86540677c9e132ac7c5216fdac46f70
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52141
Remove BufferShuffleDataSet, as it's not being used anywhere within PyTorch (no usage on Github based on a search) and it's not included in the release of PyTorch 1.7.1.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D26710940
Pulled By: ejguan
fbshipit-source-id: 90023b4bfb105d6aa392753082100f9181ecebd0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52104
Make the API of `SamplerIterDataPipe` more reasonable with `sampler_args` and `sampler_kwargs`.
Test Plan: Imported from OSS
Reviewed By: glaringlee
Differential Revision: D26401494
Pulled By: ejguan
fbshipit-source-id: ee5b5c414782d0880b12968bc9c8aa470b753f6a