Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367
This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch.
Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue.
cc VitalyFedyunin ejguan NivekT pmeier Nayef211
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D33301648
Pulled By: NivekT
fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69391
As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `FilterIterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D32849462
Pulled By: NivekT
fbshipit-source-id: 91cf1dc03dd3d3cbd7a9c6ccbd791ade91355f30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69390
As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `MapperIterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D32849465
Pulled By: NivekT
fbshipit-source-id: 963ce70b84a7658331d126e5ed9fdb12273c8e1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277
Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485823
Pulled By: NivekT
fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275
Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485822
Pulled By: NivekT
fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65220Fixes#65221
- Remove deepcopy from Mapper to support file handles
- Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch)
- Convert `IDP` to `IterableWrapper` in test_datapipe.py
- Refine the variable names (prevent using `dp` that is module reference)
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D31021886
Pulled By: ejguan
fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223
Summary:
ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit).
This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.
It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.
The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream.
Stack:
* __->__ https://github.com/pytorch/pytorch/issues/64788
* https://github.com/pytorch/pytorch/issues/64786
cc VitalyFedyunin ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788
Reviewed By: jbschlosser, ejguan
Differential Revision: D30901176
Pulled By: NivekT
fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e
Summary:
There are two warnings produced by `test_fork_datapipe`. This PR addresses the issues raised by those warnings without impacting the test cases.
cc VitalyFedyunin ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64827
Reviewed By: ejguan
Differential Revision: D30870528
Pulled By: NivekT
fbshipit-source-id: 580a001c6fa3ff6f8b04a7e5183e58861938204b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404
This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like.
cc VitalyFedyunin ejguan
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D30713120
Pulled By: NivekT
fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64220
Remove `ByKeyGrouperIterDataPipe` due to duplicated functionality.
Fix a bug in `GrouperIterDataPipe` using the existing test.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D30650542
Pulled By: ejguan
fbshipit-source-id: 666b4d28282fb4f49f3ff101b8d08be16a50d836
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422Fixes#63095
Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D30379027
Pulled By: ejguan
fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62768
This is part of TorchArrow DF support preparation, separating it to multiple PRs to simplify review process.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D30149090
Pulled By: VitalyFedyunin
fbshipit-source-id: a36b5ff56e2ac6b06060014d4cd41b487754acb8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61312
Sorting according to isort output. Alphabetically ordered one per line imports help merging.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D29588833
Pulled By: VitalyFedyunin
fbshipit-source-id: 4c80c3086132b50894e734ad6c5799d78d689e42
Summary:
As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the ConcatMapDataPipe functional datapipe for the MapDataPipe class.
We may need to discuss how to treat the datapipes with no valid length. For now, I just use them as if they have infinite length and the `__getitem__` could not go pass them.
Thank you for your time on reviewing this~
cc ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61010
Reviewed By: soulitzer
Differential Revision: D29587679
Pulled By: ejguan
fbshipit-source-id: 5eb97fa727209bec6c534520057c64a78000626e
Summary:
Fixes issues that are discussed with ezyang in the comments of PR https://github.com/pytorch/pytorch/issues/59498
Improved code and documentation clarity, and refactored .filter to nesting_level directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60423
Reviewed By: ezyang
Differential Revision: D29281599
Pulled By: NivekT
fbshipit-source-id: a9bbaf52f492db0741c00f3ceb4022b08ddb1506
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59816
Add two new DataPipes, one for getting web file urls to yield streams and one for getting streams to yield bytes.
Test Plan:
Add test_web_iterable_datapipe in test/test_datapipes.py. The test initiates a local http server for serving test files. Test below locally ok.
1. create and load 16M localhost file urls (each of size 10 Bytes)
2. create and load a 64GB localhost file
in the unit test, for sake of testing time, disabling both stress test and large file test
Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D29051186
fbshipit-source-id: f8e44491e670560bf445af96f94d98230436f396