Commit Graph

165 Commits

Author SHA1 Message Date
Kevin Tse
8ebe1a924d [DataPipe] moving mux IterDataPipe test to the right location (#66277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277

Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485823

Pulled By: NivekT

fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
2021-10-08 08:32:29 -07:00
Kevin Tse
ed17851642 [DataPipe] adding test for IterableWrapperIterDataPipe (#66276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66276

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485824

Pulled By: NivekT

fbshipit-source-id: c7b21636e4b17e264bfb5dbea69cd3c477472f0b
2021-10-08 08:32:26 -07:00
Kevin Tse
e808e3d3d6 [DataPipe] adding SequenceWrapperMapDataPipe (#66275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275

Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485822

Pulled By: NivekT

fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340
2021-10-08 08:32:24 -07:00
Erjia Guan
a1216061c1 [DataPipe] Fix deepcopy filehandle for Mapper and in-place modification for IterableWrapper (#65220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65220

Fixes #65221

- Remove deepcopy from Mapper to support file handles
- Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch)
- Convert `IDP` to `IterableWrapper` in test_datapipe.py
- Refine the variable names (prevent using `dp` that is module reference)

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31021886

Pulled By: ejguan

fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223
2021-09-21 14:29:40 -07:00
Erjia Guan
cf60d24028 [DataPipe] Unlimited buffer for Forker and Demultiplexer (#64994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64994

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D30934362

Pulled By: ejguan

fbshipit-source-id: d3b774d7e28c0b9659e999511e5a68c3929857d4
2021-09-20 09:30:39 -07:00
Kevin Tse
c625f971d3 [DataPipe] Make TarArchiveReader and ZipArchiveReader accepts FileSream with attempt to close and additional warning (#64788)
Summary:
ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit).

This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.

It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.

The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream.

Stack:
* __->__ https://github.com/pytorch/pytorch/issues/64788
* https://github.com/pytorch/pytorch/issues/64786

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788

Reviewed By: jbschlosser, ejguan

Differential Revision: D30901176

Pulled By: NivekT

fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e
2021-09-15 07:34:29 -07:00
Erjia Guan
c65128679b [DataPipe] Improve Mapper to accept input/output index when apply fn (#64951)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64951

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30910035

Pulled By: ejguan

fbshipit-source-id: d687fe10939920a3617a60552fe743e8526438a0
2021-09-14 15:46:42 -07:00
Vitaly Fedyunin
ab5e1c69a7 [WIP] Example of DataPipes and DataFrames integration (#60840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60840

Test Plan: Imported from OSS

Reviewed By: wenleix, ejguan

Differential Revision: D29461080

Pulled By: VitalyFedyunin

fbshipit-source-id: 4909394dcd39e97ee49b699fda542b311b7e0d82
2021-09-13 18:50:15 -07:00
Kevin Tse
f3f410880a [DataPipe] Remove ZipArchiveReader's dependency on FileLoader (#64786)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* https://github.com/pytorch/pytorch/issues/64788
* __->__ https://github.com/pytorch/pytorch/issues/64786

This PR removes ZipArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.

It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.

The whole stack fixes issues related to unclosed buffer stream (see https://github.com/pytorch/pytorch/issues/64281).

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64786

Reviewed By: ngimel

Differential Revision: D30870968

Pulled By: NivekT

fbshipit-source-id: 64b04d1697b99772f2fa20fc141668e6b8e18c41
2021-09-10 16:49:17 -07:00
Kevin Tse
5060b69d62 [DataPipe] fixing tests related fork() to remove warnings (#64827)
Summary:
There are two warnings produced by `test_fork_datapipe`. This PR addresses the issues raised by those warnings without impacting the test cases.

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64827

Reviewed By: ejguan

Differential Revision: D30870528

Pulled By: NivekT

fbshipit-source-id: 580a001c6fa3ff6f8b04a7e5183e58861938204b
2021-09-10 11:01:42 -07:00
Kevin Tse
4ce9c530d6 [DataPipe] removing filter's inheritance from map (#64404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404

This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like.

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30713120

Pulled By: NivekT

fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c
2021-09-02 13:09:47 -07:00
Kevin Tse
4f43480186 [DataPipe] adding/removing __len__ for different DataPipe (#64398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64398

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30710437

Pulled By: NivekT

fbshipit-source-id: 524eda43a2faa0db0c1a662bf9bb4283f0ade83c
2021-09-02 13:08:32 -07:00
Kevin Tse
491bf7cb74 [DataPipe] adding description, __len__, tests for mux() (#64224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64224

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30651551

Pulled By: NivekT

fbshipit-source-id: f8af98ba71a592900b992a8077432062ec57bb48
2021-08-31 14:34:28 -07:00
Kevin Tse
0ef8760bf6 [DataPipe] implementing __len__ for fork (no valid length for demux) (#64215)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64215

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30648672

Pulled By: NivekT

fbshipit-source-id: 4780f2f6a79ae15a4009092475e7d92f96dd09a2
2021-08-31 08:32:31 -07:00
Kevin Tse
0deb7a0bc0 [DataPipe] implementing demux() (#63650)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63650

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30493944

Pulled By: NivekT

fbshipit-source-id: 0aa06dee8c7fb1744975b8f6a0694b90c11ef80d
2021-08-31 08:32:29 -07:00
Kevin Tse
eee054e6ea [DataPipe] implementing fork() (#63649)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63649

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30493945

Pulled By: NivekT

fbshipit-source-id: 40db7d4134facd266d86bc0dc2edf2729c4e5842
2021-08-31 08:32:27 -07:00
Erjia Guan
af85bc5ffd Replace group_by_key by group_by IterDataPipe (#64220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64220

Remove `ByKeyGrouperIterDataPipe` due to duplicated functionality.
Fix a bug in `GrouperIterDataPipe` using the existing test.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30650542

Pulled By: ejguan

fbshipit-source-id: 666b4d28282fb4f49f3ff101b8d08be16a50d836
2021-08-30 18:45:44 -07:00
Erjia Guan
7946f8a9f6 Rename DataPipe to Op-er (#63325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63325

Rename each DataPipe to an operation name ending with er. Functional API should remain `verb` such as `read_from_tar` , `shuffle`, ... (Discussed in [here](https://github.com/facebookexternal/torchdata/pull/97#discussion_r688553905))
- Batch -> Batcher
- Collate -> Collator
- Concat -> Concater
- GroupByKey - > ByKeyGrouper ?
- ListDirFiles -> FileLister
- LoadFilesFromDisk -> FileLoader
- Map -> Mapper
- ReadFilesFromTar -> TarArchiveReader
- ReadFilesFromZip -> ZipArchiveReader
- ReadLinesFromFile -> LineReader
- Shuffle -> Shuffler
- ToBytes -> StreamReader
- Transforms -> Transformer
- Zip -> Zipper

Let me know if you have better name for each DataPipe

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30466950

Pulled By: ejguan

fbshipit-source-id: 72909dca7b3964ab83b965891f96cc1ecf62d049
2021-08-23 14:36:10 -07:00
Erjia Guan
383a33a0eb Make DataChunk support list in-place ops (#63422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422

Fixes #63095

Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30379027

Pulled By: ejguan

fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d
2021-08-18 08:48:47 -07:00
Erjia Guan
d1cbee7b2b Refactor BucketBatch (#63185)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63185

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288893

Pulled By: ejguan

fbshipit-source-id: b88b792d12a83c99d8ea9e516e3b4c54a82100f6
2021-08-16 06:42:56 -07:00
Erjia Guan
56d609d93e Replace str by repr for DataChunk (#63184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63184

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288892

Pulled By: ejguan

fbshipit-source-id: 45c88fdd3987e234f2c22ebbbfd8d5044983c34c
2021-08-16 06:41:38 -07:00
Shen Li
1022443168 Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer

Differential Revision:
D30279364 (b004307252)

Original commit changeset: c1ed77dfe43a

fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e
2021-08-12 11:45:01 -07:00
Zsolt Dollenstein
b004307252 [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle

Reviewed By: zertosh

Differential Revision: D30279364

fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
2021-08-12 10:58:35 -07:00
Vitaly Fedyunin
d3bdf345cb Introducing DataChunk for DataPipes batching (#62768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62768

This is part of TorchArrow DF support preparation, separating it to multiple PRs to simplify review process.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30149090

Pulled By: VitalyFedyunin

fbshipit-source-id: a36b5ff56e2ac6b06060014d4cd41b487754acb8
2021-08-06 08:38:33 -07:00
Vitaly Fedyunin
4ef640d6f6 Sort imports of test_datapipe.py (#61312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61312

Sorting according to isort output. Alphabetically ordered one per line imports help merging.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588833

Pulled By: VitalyFedyunin

fbshipit-source-id: 4c80c3086132b50894e734ad6c5799d78d689e42
2021-07-12 15:33:20 -07:00
Vitaly Fedyunin
fd13e925ec Adding backward compatibility for sharding support in old DataLoader (#61237)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61237

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588832

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bfa4417f6a04450f656ecf28fc95322d2cf076a
2021-07-12 14:53:45 -07:00
Vitaly Fedyunin
d3cb065b2f Implement usage of is_shardable and apply_sharding (#61236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61236

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588835

Pulled By: VitalyFedyunin

fbshipit-source-id: 00c3042f96af498637b2dcf6e3f842c1fc05ddd8
2021-07-12 14:23:20 -07:00
Vitaly Fedyunin
f2857883c4 Add DataPipes Graph Functions (#61235)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61235

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588834

Pulled By: VitalyFedyunin

fbshipit-source-id: e0331d6e1fc2a3f8b6211aac83965bcf13165161
2021-07-12 10:28:35 -07:00
Vitaly Fedyunin
99959fe3f5 [DataLoader] Adding demux and mux DataPipe-s (#61234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61234

* **#61234 [WIP] Adding demux and mux DataPipe API examples**

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588836

Pulled By: VitalyFedyunin

fbshipit-source-id: 523d12ea6be7507d706b4c6d8827ec1ac4ccabc3
2021-07-12 10:04:03 -07:00
zilinzhu
c19adfff54 [DataLoader] Introduce ConcatMapDataPipe functional datapipe (#61010)
Summary:
As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the ConcatMapDataPipe functional datapipe for the MapDataPipe class.

We may need to discuss how to treat the datapipes with no valid length. For now, I just use them as if they have infinite length and the `__getitem__` could not go pass them.

Thank you for your time on reviewing this~

cc ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61010

Reviewed By: soulitzer

Differential Revision: D29587679

Pulled By: ejguan

fbshipit-source-id: 5eb97fa727209bec6c534520057c64a78000626e
2021-07-09 09:29:18 -07:00
Vitaly Fedyunin
a652398465 [DataLoader] Rename transform DataPipe to legacy_transform (#60670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60670

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29461081

Pulled By: VitalyFedyunin

fbshipit-source-id: 57f53a91db9032a6126e86243ddea9149c473060
2021-06-30 09:49:14 -07:00
Kevin Tse
df8a8fbc1b Improve code and documentation clarity for DataPipes APIs (#60423)
Summary:
Fixes issues that are discussed with ezyang in the comments of PR https://github.com/pytorch/pytorch/issues/59498

Improved code and documentation clarity, and refactored .filter to nesting_level directly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60423

Reviewed By: ezyang

Differential Revision: D29281599

Pulled By: NivekT

fbshipit-source-id: a9bbaf52f492db0741c00f3ceb4022b08ddb1506
2021-06-22 11:19:08 -07:00
Jiong Gu
a120a12ab4 [Bootcamp][pytorch]Add WebIterDataPipe and ToBytesIterDataPipe to the datapipes. (#59816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59816

Add two new DataPipes, one for getting web file urls to yield streams and one for getting streams to yield bytes.

Test Plan:
Add test_web_iterable_datapipe in test/test_datapipes.py. The test initiates a local http server for serving test files. Test below locally ok.
1. create and load 16M localhost file urls (each of size 10 Bytes)
2. create and load a 64GB localhost file
in the unit test, for sake of testing time, disabling both stress test and large file test

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D29051186

fbshipit-source-id: f8e44491e670560bf445af96f94d98230436f396
2021-06-15 11:43:26 -07:00
Erjia Guan
e7ad82eb2f [DataLoader] Add option to refine type during runtime validation for DP instance (#56066)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56066

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27776646

Pulled By: ejguan

fbshipit-source-id: 695ff7775177653d809c5917d938c706281e1298
2021-06-10 14:04:24 -07:00
Kevin Tse
fa030d1213 [DataPipes] Add simple unbatch to DataPipe (#59610)
Summary:
Implements the simple unbatch feature for DataPipe https://github.com/pytorch/pytorch/issues/58148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59610

Reviewed By: VitalyFedyunin

Differential Revision: D28994180

Pulled By: NivekT

fbshipit-source-id: 4bafe6e26c4f95a808c489b147369413a196fa1c
2021-06-09 16:53:31 -07:00
Kevin Tse
12b4e8996f [DataLoader] Add nesting_level argument to map and filter (#59498)
Summary:
This PR implements the .map and .filter APIs for IterDataPipe.

[DataPipes] Make .map of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58145
[DataPipes] Make .filter of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59498

Reviewed By: ejguan

Differential Revision: D28964280

Pulled By: NivekT

fbshipit-source-id: b1ee6cafa3953093ebd7bf30eacc80c3ef7cd190
2021-06-09 07:40:53 -07:00
Erjia Guan
5c7e14d2bc [DataLoader] Switch NotImplementedError to TypeError for len (#59464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59464

Fixes #59378

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28944447

Pulled By: ejguan

fbshipit-source-id: 8b3d53a1863b41e578d56f219e452d18d7eae0d8
2021-06-08 07:16:18 -07:00
Erjia Guan
1b578c4bf5 [DataLoader] Close byte stream explicitly (#58938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58938

When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28689862

Pulled By: ejguan

fbshipit-source-id: bb2a85defb8a4ab5384db902ef6ad062185c2653
2021-06-08 07:15:08 -07:00
Erjia Guan
0e16087064 [DataLoader] Fix bugs for typing (#58450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58450

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28507877

Pulled By: ejguan

fbshipit-source-id: f4051ff51ce77ef45214f11cba10c8a7e1da4dad
2021-05-24 07:14:40 -07:00
Marcio Porto
4942fe0290 [DataLoader] Introduce MapMapDataPipe functional datapipe (#58258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58258

As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the `MapMapDataPipe` functional datapipe for the `MapDataPipe` class.

Usage:
```
def fn(x):
    return x * 10

dp = CountingDataset(n=10)
dp.map(fn)
```

Reviewed By: ejguan

Differential Revision: D28394510

fbshipit-source-id: 8d71b1f5723dff52385c3ce753944304896af678
2021-05-20 09:00:21 -07:00
Erjia Guan
3b977b3b4d [DataLoader] Add context manager for runtime type validation (#55936)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55936

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27743476

Pulled By: ejguan

fbshipit-source-id: 8f0454ccf3ec37807598056433bff91013fa9bb9
2021-05-12 11:59:16 -07:00
Erjia Guan
5c696443c7 [DataLoader] Modfity construct_time_validation to argument_validation (#55836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55836

Change construct_time_validation to argument_validation as we should provide users the flexibility to use this decorator over all different functions, which are required with type validation.

It can also work as a construct-time validation
```py
class ExampleDataPipe(IterDataPipe):
    argument_validation
    def __init__(self, dp: IterDataPipe[int]):
        self.dp = dp

    ...
```
Notebook is also updated.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27743478

Pulled By: ejguan

fbshipit-source-id: 49743152d121028cd7d72d89dc7df5c7c7b94c41
2021-05-12 11:58:05 -07:00
Erjia Guan
b58a7c95aa [DataLoader] Raise detailed Error for ForwardRef type (#57824)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57824

Implement type check for string type. Re-raise detailed exception at compile time.
```py
>>> class InvalidData(Generic[T_co], NamedTuple):  # Invalid generic namedtuple in Python typing
...     name: str
...     data: T_co

class DP(IterDataPipe['InvalidData[int]']):
...     pass
TypeError: InvalidData[int] is not supported by Python typing
```

Add `__type_class__` attribute to class, which optimizes the static checking flow by reducing checking times.
```py
>>> class DP1(IterDataPipe[Union[int, str]]):
...     pass
>>> class DP2(DP1[int]):
...     pass
>>> list((cls, getattr(cls, '__type_class__', None)) for cls in DP2.__mro__)
[(<class '__main__.DP2'>, False), (<class 'abc.DP1[int]'>, True), (<class '__main__.DP1'>, False), (<class 'abc.IterableDataset[typing.Union[int, str]]'>, True), (<class 'torch.utils.data.dataset.IterableDataset'>, False), (<class 'torch.utils.data.dataset.Dataset'>, None), (<class 'typing.Generic'>, None), (<class 'object'>, None)]
```
Among the class of `DP2`'s MRO, only `DP2`, `DP1` will be static checked when `__type_class__` is `False`. `abc.DP1[int]` and `abc.IterableDataset[typing.Union[int, str]]` will be ignored since they are just a class with typing.

## Future
When Python 3.6 is deprecated, using TypeAlias rather than TypeMeta can eliminates the usage of `__type_class__` attribute.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28289104

Pulled By: ejguan

fbshipit-source-id: 1da97460c8bfc48cea7396033fde484a24caba7c
2021-05-11 13:38:30 -07:00
Erjia Guan
ece15f6902 [DataLoader] Change Decoder signature and add MatHandler (#57391)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57391

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28151601

Pulled By: ejguan

fbshipit-source-id: 34814197d2f068cab0c7ca2330152ad588eb1ef0
2021-05-10 06:29:00 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Erjia Guan
0b1c3dfae4 [DataLoader] Typing Enforcement for DataPipe at runtime (#54544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54544

## Feature
- Add `subinstance(data, type)` to check `data` is a subtype instance of the `type`
- Add a decorator of `runtime_validation` to validate the returned data from `__iter__` is subtype instance of hint.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327234

Pulled By: ejguan

fbshipit-source-id: fb6a332762b0fe75284bb2b52a13ed171b42558c
2021-04-02 15:22:32 -07:00
Erjia Guan
1535520f08 [DataLoader] Typing Enforcement for DataPipe at construct-time (#54066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54066

## Feature
- Add a decorator `construct_time_validation` to validate each input datapipe according to the corresponding type hint.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327236

Pulled By: ejguan

fbshipit-source-id: a9d4c6edb5b05090bd5a369eee50a6fb4d7cf957
2021-04-02 15:22:29 -07:00
Erjia Guan
44edf8c421 [DataLoader] Typing Enforcement for DataPipe at Compile-time (#54020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54020

## Feature
- Add `issubtype` to check the type is a subtype of the other type.
- Add `_DataPipeMeta` (mimic Python typing 3.6)
  - Add `type` attribute for each DataPipe
  - Save original `__init__` function for each DataPipe
  - Validate return hint of `__iter__`
  - Replace `__init__` function bases on `type`
    - Fixed type: Put original `__init__` back, if it exists or use a plain `__init__`
    -  Non-fixed type: Add new `__init__` with the functionality to copy `cls.type` for each instance. (Optimized for memory)

No Error for main repo, `torchvision`, `torchaudio` and `torchtext`.

## Future
- Add same thing for `__getitem__`.
- When DataFrame came out, add an another type for DataFrame with column name and type.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327232

Pulled By: ejguan

fbshipit-source-id: fd3a6029c16f5d814b1d7e1b1566fdcd8fd1ad9a
2021-04-02 15:22:27 -07:00
Erjia Guan
560e3be587 [DataLoader] Implement issubtype for type hints (#54299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54299

## Feature
- Check type is a subtype of another type

Prerequisite for DataPipe tying system.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327235

Pulled By: ejguan

fbshipit-source-id: 8f50a663a86540677c9e132ac7c5216fdac46f70
2021-04-02 15:20:55 -07:00
Erjia Guan
fff0a3f906 [DataLoader] ZipIterDataPipe (#53554)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53554

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26913406

Pulled By: ejguan

fbshipit-source-id: 24604b41d08eb6f7689add152229049a4c65c06e
2021-03-12 08:26:21 -08:00
Erjia Guan
1ba80264f4 [DataLoader] ConcatDataPipe (#53301)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53301

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D26829322

Pulled By: ejguan

fbshipit-source-id: eeea42fd9ab267d10f39ad7debc279eaded23570
2021-03-06 07:32:02 -08:00
Erjia Guan
c957e2ab42 Add more datapipe to functional API (#53123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53123

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26756638

Pulled By: ejguan

fbshipit-source-id: 6ff0eb6c7ee702056ff19eeb723949e4642f2784
2021-03-03 07:01:00 -08:00
Erjia Guan
89b1053413 [DataLoader] Move BufferedShuffle from Dataset to DataPipe (#52141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52141

Remove BufferShuffleDataSet, as it's not being used anywhere within PyTorch (no usage on Github based on a search) and it's not included in the release of PyTorch 1.7.1.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26710940

Pulled By: ejguan

fbshipit-source-id: 90023b4bfb105d6aa392753082100f9181ecebd0
2021-03-01 12:54:44 -08:00
Erjia Guan
b534466f01 [DataLoader] TransformsIterDataPipe (#52604)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52604

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26581511

Pulled By: ejguan

fbshipit-source-id: c927726b7afba14586f16cde0237f2cef9080079
2021-02-23 15:47:27 -08:00
Erjia Guan
4ee5bc74d3 [DataLoader] Change signature of Functional DataPipe (#52458)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52458

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26523282

Pulled By: ejguan

fbshipit-source-id: c7358fc351f859617754a27b8a701d11ada5d61a
2021-02-18 23:30:58 -08:00
Erjia Guan
059c564ba4 [DataLoader] Fix module import (#52224)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52224

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26429871

Pulled By: ejguan

fbshipit-source-id: fcf2e5435658ecb92af1079def953b08cebb1f7f
2021-02-16 16:12:33 -08:00
Erjia Guan
425a5dc3f7 [DataLoader] Modify SamplerIDP signature (#52104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52104

Make the API of `SamplerIterDataPipe` more reasonable with `sampler_args` and `sampler_kwargs`.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26401494

Pulled By: ejguan

fbshipit-source-id: ee5b5c414782d0880b12968bc9c8aa470b753f6a
2021-02-11 09:29:52 -08:00
Erjia Guan
9eb70c3c78 [DataLoader] Rename Callable to Map IterDataPipe (#51879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51879

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D26314775

Pulled By: ejguan

fbshipit-source-id: ee77909eae97092155ed6a6c794540e68a04d754
2021-02-09 17:09:06 -08:00
Erjia Guan
104371e1dc [DataLoader] Implement FilterIterDataPipe (#51783)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51783

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D26277688

Pulled By: ejguan

fbshipit-source-id: 25ed7da9da88c030b29627142c2f04fed26cdcda
2021-02-09 17:06:06 -08:00
lixinyu
015cabf82a move GroupByFilename Dataset to DataPipe (#51709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51709

Move GroupByFilename Dataset to DataPipe

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26263585

Pulled By: glaringlee

fbshipit-source-id: 00e3e13b47b89117f1ccfc4cd6239940a40d071e
2021-02-09 03:34:56 -08:00
lixinyu
482b94ae51 move RoutedDecoder Dataset to DataPipe (#51704)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51704

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26245910

Pulled By: glaringlee

fbshipit-source-id: 91e3c9f8a6c1209c1a1a752ba29a80dbd9bf4119
2021-02-09 03:31:07 -08:00
lixinyu
1ee0c42d6d move ZipDataset to Zip DataPipe (#51599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51599

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26212859

Pulled By: glaringlee

fbshipit-source-id: 3fabcf8876d3c9c24339dbf6a12e0bb04b400108
2021-02-03 15:42:59 -08:00
Erjia Guan
52de407b4b [DataLoader] Rename Functional DataSet to DataPipe (#51488)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51488

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26209888

Pulled By: ejguan

fbshipit-source-id: cb8bc852b1e4d72be81e0297308a43954cd95332
2021-02-03 07:01:09 -08:00
lixinyu
c0d58bce0d move Tar Dataset to Tar DataPipe (#51398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51398

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26162319

Pulled By: glaringlee

fbshipit-source-id: a84879fe4ca044e34238d5e1d31a245d4b80ae8e
2021-02-02 07:46:53 -08:00
lixinyu
5ed0ad4b6a DataPipe naming convension update (#51262)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51262

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26120628

Pulled By: glaringlee

fbshipit-source-id: 6855a0dd6d4a93ff93adce1039960ffd7057a827
2021-01-28 17:44:36 -08:00