Commit Graph

114 Commits

Author SHA1 Message Date
Robert Xiu
9fca008809 [DataPipe] Adding functional API for FileLister (#78419)
Fixes #78263

Follow-up from pytorch/data#387. This adds a functional API `list_files()` to `FileListerDataPipe`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78419
Approved by: https://github.com/NivekT, https://github.com/ejguan
2022-06-06 17:26:19 +00:00
erjia
9b6cb83b0c Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765)
Fixes https://github.com/pytorch/data/issues/426

This PR introduces two main changes:
- It ensures the `ShufflerDataPipe` would share the same seed across distributed processes.
- Users can reset `shuffle` for persistent workers per epoch.

Detail:
- `shared_seed` is shared across distributed and worker processes. It will seed a `shared_rng` to provide seeds to each `ShufflerDataPipe` in the pipeline
- `worker_loop` now accepts a new argument of `shared_seed` to accept this shared seed.
- The `shared_seed` is attached to `_ResumeIteration` for resetting seed per epoch for `persistent worker`
- I choose not to touch `base_seed` simply for BC issue

I used this [script](https://gist.github.com/ejguan/d88f75fa822cb696ab1bc5bc25844f47) to test the result with `world_size=4`. Please check the result in: https://gist.github.com/ejguan/6ee2d2de12ca57f9eb4b97ef5a0e300b

You can see there isn't any duplicated/missing element for each epoch. And, with the same seed, the order of data remains the same across epochs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78765
Approved by: https://github.com/VitalyFedyunin
2022-06-06 17:24:00 +00:00
PyTorch MergeBot
129d9dbb15 Revert "Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765)"
This reverts commit b769a0e18b.

Reverted https://github.com/pytorch/pytorch/pull/78765 on behalf of https://github.com/janeyx99 due to broke lint on trunk
2022-06-06 14:24:51 +00:00
erjia
b769a0e18b Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765)
Fixes https://github.com/pytorch/data/issues/426

This PR introduces two main changes:
- It ensures the `ShufflerDataPipe` would share the same seed across distributed processes.
- Users can reset `shuffle` for persistent workers per epoch.

Detail:
- `shared_seed` is shared across distributed and worker processes. It will seed a `shared_rng` to provide seeds to each `ShufflerDataPipe` in the pipeline
- `worker_loop` now accepts a new argument of `shared_seed` to accept this shared seed.
- The `shared_seed` is attached to `_ResumeIteration` for resetting seed per epoch for `persistent worker`
- I choose not to touch `base_seed` simply for BC issue

I used this [script](https://gist.github.com/ejguan/d88f75fa822cb696ab1bc5bc25844f47) to test the result with `world_size=4`. Please check the result in: https://gist.github.com/ejguan/6ee2d2de12ca57f9eb4b97ef5a0e300b

You can see there isn't any duplicated/missing element for each epoch. And, with the same seed, the order of data remains the same across epochs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78765
Approved by: https://github.com/VitalyFedyunin
2022-06-06 13:36:37 +00:00
Kevin Tse
b4a6730ce1 [DataPipe] Refactor 'mux' to have buffer as an instance variable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77775

Approved by: https://github.com/ejguan
2022-05-19 19:55:27 +00:00
erjia
99f6e614e8 Seed Shuffler for MP DataLoader without explicit manual_seed. (#77855)
Follow up on https://github.com/pytorch/pytorch/pull/77741

This PR guarantees the `Shuffler` in first iteration with MP DataLoader has the same seed across worker processes when users don't specify the seed.
Check newly added tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77855
Approved by: https://github.com/NivekT
2022-05-19 17:28:26 +00:00
erjia
365ce350cb Make ShufflerDataPipe deterministic for SP & MP DataLoader (#77741)
This is the first PR to make DataPipe deterministic.

Users should be able to use `torch.manual_seed(seed)` to control the shuffle order for the following cases:
- Directly over `DataPipe`
- For single-process DataLoader
- Multiprocessing DataLoader

Unfortunately, for distributed training, users have to run `apply_shuffle_seed` manually to make sure all distributed processes having the same order of shuffle.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77741
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-05-18 23:32:07 +00:00
Ning Li (Seattle)
4d1ead6dff [DataPipe] Update mux data pipe (#76384) (#77145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384

OSS issue discussion: https://github.com/pytorch/data/issues/346
This diff updates `mux` and `mux_longest` data pipe.
`mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted.

`mux` example:

```
>>> from torchdata.datapipes.iter import IterableWrapper
>>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
>>> list(dp1.mux(dp2, dp3))
[0, 10, 20, 1, 11, 21, 2, 12, 22]
```

Test Plan:
buck test mode/opt //caffe2/test:datapipe

https://www.internalfb.com/intern/testinfra/testrun/4785074706282345

Differential Revision: D36017945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77145
Approved by: https://github.com/NivekT, https://github.com/ejguan
2022-05-18 16:23:07 +00:00
Kevin Tse
bbaefdf6b5 [DataPipe] Enforcing single valid iterator for IterDataPipes multiple DataPipes as outputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75995

Approved by: https://github.com/VitalyFedyunin
2022-05-18 01:31:39 +00:00
Kevin Tse
7c52f204e0 [DataPipe] Enforcing single valid iterator for IterDataPipes without multiple outputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70479

Approved by: https://github.com/ejguan
2022-05-18 01:31:38 +00:00
Vitaly Fedyunin
edffd595c2 [DataLoader] Adding ability to use dill to pass DataPipes in mutiprocessing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77288

Approved by: https://github.com/ejguan, https://github.com/NivekT
2022-05-15 23:04:03 +00:00
Kevin Tse
a008d19ff7 [DataPipe] Revamp serialization logic of DataPipes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74984

Approved by: https://github.com/ejguan
2022-05-10 16:16:46 +00:00
zengk95
ef63408853 Revert [DataPipe] Update mux data pipe
Reverts #76384

this this is breaking tests test_demux_mux_datapipe (__main__.TestIterableDataPipeBasic. See logs: a997046017
and was red on the PR as well: https://hud.pytorch.org/pytorch/pytorch/pull/76384
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76507
Approved by: https://github.com/kit1980
2022-04-28 00:06:30 +00:00
Ning Li (Seattle)
a997046017 [DataPipe] Update mux data pipe (#76384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384

OSS issue discussion: https://github.com/pytorch/data/issues/346
This diff updates `mux` and `mux_longest` data pipe.
`mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted.

`mux` example:

```
>>> from torchdata.datapipes.iter import IterableWrapper
>>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
>>> list(dp1.mux(dp2, dp3))
[0, 10, 20, 1, 11, 21, 2, 12, 22]
```

Test Plan:
buck test mode/dev //pytorch/data/test:tests -- --exact 'pytorch/data/test:tests - test_mux_longest_iterdatapipe (test_datapipe.TestDataPipe)'

https://www.internalfb.com/intern/testinfra/testrun/3096224791148107

Reviewed By: ejguan

Differential Revision: D35799965

fbshipit-source-id: 320e71a342ec27e6e9200624aad42f4b99f97c3a
(cherry picked from commit 741ed595275df6c05026ed6f0e78d7052328fb7d)
2022-04-27 22:10:42 +00:00
erjia
0ff05b1e97 [DataPipe] Add funtional API docstring and fix typo in test
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76272
Approved by: https://github.com/ishaan-mehta, https://github.com/NivekT
2022-04-25 14:16:53 +00:00
Kevin Tse
383f026791 [DataPipe] Enabling graph traversal for MapDataPipe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74851

Approved by: https://github.com/ejguan
2022-04-22 18:06:16 +00:00
erjia
ec591087fb [DataPipe] Add input_col to filter and add deprecation warning for DataPipe arguments
Last patch to align DataPipe API with TorchArrow DataFrame

For deprecation warning of DataPipe argument:
```
The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76060
Approved by: https://github.com/NivekT
2022-04-22 17:49:39 +00:00
erjia
b8cce8847f [DataPipe] Add functional API to StreamReader and FileOpener
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76233
Approved by: https://github.com/NivekT
2022-04-22 17:49:26 +00:00
erjia
841a7f5187 [DataPipe] apply dill serialization for _Demux and add cache to traverse
- Fix _Demux can not be pickled with DILL presented https://github.com/pytorch/pytorch/pull/74958#issuecomment-1084637227
- And add cache to traverse function to prevent infinite recursion for circular reference of DataPipe (Fixes https://github.com/pytorch/data/issues/237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75034
Approved by: https://github.com/wenleix
2022-04-04 19:45:14 +00:00
Kevin Tse
4c5d532728 [DataPipe] only apply special serialization when dill is installed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74958

Approved by: https://github.com/ejguan
2022-03-30 20:38:05 +00:00
Nicolas Hug
5667c4ea21 Remove default parameter of ShufflerIterDataPipe (#74370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74370

Closes https://github.com/pytorch/data/issues/298. This PR:

- removes the `default` parameter of `ShufflerIterDataPipe`
- renames `set_shuffle_setting()` into `set_shuffle()`
- let `set_shuffle()` return `self`.

Test Plan: Imported from OSS

Reviewed By: george-qi

Differential Revision: D35073666

Pulled By: NicolasHug

fbshipit-source-id: 9847b037e70f44f36eaf4471f2c12fa8ec2ed73c
(cherry picked from commit b07ab646f308532886e8daddd57e937a53edb153)
2022-03-28 12:47:24 +00:00
Kevin Tse
eec994fc16 [DataPipe] Separating DataPipes from Dataset into different files (#73396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396

Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.

I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?

Fixes https://github.com/pytorch/data/issues/213

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34481962

Pulled By: NivekT

fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
(cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)
2022-03-15 14:46:34 +00:00
Kevin Tse
8811d217ed [DataPipe] Slight refactoring IterDataPipe serialization test (#73922)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73922

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34732288

Pulled By: NivekT

fbshipit-source-id: f31229332fe4eac85cc2085484f6e1b1d802987d
(cherry picked from commit ace20054e4f3f9bd9610640755400fbde82650c3)
2022-03-09 15:33:12 +00:00
Kevin Tse
0821154072 [DataPipe] Adding serialization test for all MapDataPipe (#73921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73921

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34732286

Pulled By: NivekT

fbshipit-source-id: 893af2fbb83feb1bae226d3205105de5d3836378
(cherry picked from commit f44fd3c5210d0afdbf826e3b7e7fbe2ec216c3b7)
2022-03-09 15:33:12 +00:00
Kevin Tse
f85309e478 [DataPipe] Adding serialization test at different stages of reading for IterDataPipes (#73119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73119

Test if a DataPipe is serializable after its contents are partially read and completely read. This is especially important for DataPipes with buffers.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34354496

Pulled By: NivekT

fbshipit-source-id: 36971d68b9ca1de81fb254e9a459b8f54fe0f9ff
(cherry picked from commit e8f39a7aa364bd2b19145788f7e67c06f948f81b)
2022-02-23 16:31:21 +00:00
Kevin Tse
cd4ecce1bb [DataPipe] Fix issue with DataPipe serialization with dill (#72896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72896

Fixing the issue described here: https://github.com/pytorch/data/issues/214

There will be a follow-up PR in TorchData as well

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D34258669

Pulled By: NivekT

fbshipit-source-id: 6dd88250ed14ebe779915dc46139be7e012e9d1b
(cherry picked from commit 025b8ed98019e576bfef04c33a3f33ed1a426a66)
2022-02-23 16:31:20 +00:00
Erjia Guan
6297aa114f [DataPipe] Extend FileLister to support load multiple directories (#72260)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72260

Test Plan: Imported from OSS

Reviewed By: dagitses, NivekT

Differential Revision: D33979744

Pulled By: ejguan

fbshipit-source-id: 5733d20382642fc2274afd838b33c98150d81e91
(cherry picked from commit f70537ae76)
2022-02-04 07:55:00 +00:00
Erjia Guan
7b014cc645 [DataPipe] Disable Typing for DataPipe before branch cut (#72123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72123

There is a bug to fix the typing system in DataPipe, which would take more than 1 week to fix. I will follow up on it later this month. As branch cut is today, add this PR to disable typing to make sure release works.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33920610

Pulled By: ejguan

fbshipit-source-id: febff849ab2272fd3b1c5127a20f27eb82992d9c
(cherry picked from commit ee103e62e7)
2022-02-02 05:00:41 +00:00
Santiago Castro
5024c1bc7b Make get_file_pathnames_from_root output order deterministic (#70435)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70103

I used an argument so it can be disabled. I called it `deterministic_order` because `sort` can be confusing, as it's actually sorted but by dir levels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70435

Reviewed By: albanD

Differential Revision: D33899755

Pulled By: ejguan

fbshipit-source-id: e8a08f03a49120333b2d27f332cd21a3240a02a9
(cherry picked from commit 4616e43ec3)
2022-02-01 18:12:23 +00:00
Vitaly Fedyunin
b36b11cbc1 Separating CaptureDataFrame out of DFIterDataPipe (#71776)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71776

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33771602

Pulled By: VitalyFedyunin

fbshipit-source-id: 59d85bc707a9568f1f0960fc184113a4f422d2df
(cherry picked from commit 93522768ef)
2022-01-26 03:25:02 +00:00
Erjia Guan
bb157dd4eb Make methods of internal file_obj visible from StreamWrapper (#71653)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71653

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33718749

Pulled By: ejguan

fbshipit-source-id: f3a8244f22ca37049b8678afa0e329b23c957a9d
(cherry picked from commit a4d12ca48e)
2022-01-25 15:34:24 +00:00
Kevin Tse
13ea2cb330 [DataPipe] Make GroupBy serializable with lambda function (#71497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71497

Related to https://github.com/pytorch/data/issues/172

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33668749

Pulled By: NivekT

fbshipit-source-id: 6506614e9d4389dc645d8985c00fdb3402122d9b
(cherry picked from commit 458e76fcb1)
2022-01-21 16:04:45 +00:00
Kevin Tse
36b4c95e74 [DataPipe] adding serialization test for all core IterDataPipes (#71456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71456

Related to https://github.com/pytorch/data/issues/172

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33668748

Pulled By: NivekT

fbshipit-source-id: ea2085d5ed47533ca49258cc52471373c6ae1847
(cherry picked from commit d5f6fde1d0)
2022-01-21 16:04:45 +00:00
Kevin Tse
011fd1d933 [DataPipe] improving DataPipe unit tests (#70215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70215

A few renaming, formatting, and additional tests to make the unit tests better.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33344610

Pulled By: NivekT

fbshipit-source-id: bb36f7452bdc44964c9ce0650c7ae308ba2c5aa5
(cherry picked from commit 0aae20cb27)
2022-01-20 15:49:53 +00:00
Erjia Guan
fd9e08df5d Make Demux serializable with lambda function (#71311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71311

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33584552

Pulled By: ejguan

fbshipit-source-id: 52324faf5547f9f77582ec170ec91ce3114cfc61
2022-01-18 06:47:54 -08:00
Kevin Tse
1e3893ecbb [DataPipe] Removing deprecated DataPipes (#71161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71161

Users should import these DataPipes from [TorchData](https://github.com/pytorch/data) if they would like to use them. We will be checking for any downstream library usage before landing this PR.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33532272

Pulled By: NivekT

fbshipit-source-id: 9dbfb21baf2d1183e0aa379049ad8304753e08a1
2022-01-13 07:37:48 -08:00
Kevin Tse
8dcfdf39e7 [DataPipe] Renaming FileLoader to FileOpener with deprecation warning for FileLoader (#70367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367

This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch.

Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue.

cc VitalyFedyunin ejguan NivekT pmeier Nayef211

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33301648

Pulled By: NivekT

fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b
2022-01-04 09:14:50 -08:00
Kevin Tse
ad0cd8a76e [DataPipe] Improve inline doc and testing for CollatorIterDataPipe (#70139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70139

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33199107

Pulled By: NivekT

fbshipit-source-id: f96d77490998ac9bc3da8d4ff1a9caa08e9e7f27
2021-12-20 08:05:21 -08:00
Kevin Tse
3d51c88032 [DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from MapDataPipes (#69561)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69561

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32952099

Pulled By: NivekT

fbshipit-source-id: 95b725774a9d04d655e2542760726908f33043f4
2021-12-16 18:11:00 -08:00
Kevin Tse
b89c283c80 [DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from IterDataPipes (#69560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69560

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32952100

Pulled By: NivekT

fbshipit-source-id: e0cc31408c7cf3220fe274feed1c7202a1aaae70
2021-12-16 18:09:52 -08:00
Vitaly Fedyunin
d90012689f [DataPipe] Control shuffle settings from DataLoader2 (#65756)
Summary:
Makes `shuffle` DataPipe sensitive to DataLoader(2) `shuffle` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65756

Reviewed By: albanD

Differential Revision: D31344867

Pulled By: VitalyFedyunin

fbshipit-source-id: e0084e0ac193ac784d6298328ca1222745681347
2021-12-14 07:35:26 -08:00
Kevin Tse
81a60b9813 [DataPipe] Adding output types to DataPipe interface file (#69647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69647

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32989067

Pulled By: NivekT

fbshipit-source-id: 2c2e71e9e514e0d584affaa0b71b7b0d07a2ddbf
2021-12-10 12:04:45 -08:00
Kevin Tse
357160e68e [DataPipe] Unifying API - removing nesting_level argument from FilterIterDataPipe (#69391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69391

As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `FilterIterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32849462

Pulled By: NivekT

fbshipit-source-id: 91cf1dc03dd3d3cbd7a9c6ccbd791ade91355f30
2021-12-07 11:40:46 -08:00
Kevin Tse
4478b14e4c [DataPipe] Unifying API - removing nesting_level argument from MapperIterDataPipe (#69390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69390

As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `MapperIterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32849465

Pulled By: NivekT

fbshipit-source-id: 963ce70b84a7658331d126e5ed9fdb12273c8e1f
2021-12-07 11:39:08 -08:00
Kevin Tse
6baaec30cd [DataPipe] Adding ShufflerMapDataPipe (#68606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68606

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32813290

Pulled By: NivekT

fbshipit-source-id: 8d1ebd5bc776563c23250f76a2efc1d395f1af9c
2021-12-03 11:36:33 -08:00
Kevin Tse
0465f64bb8 [DataPipe] Adding BatcherMapDataPipe (#68197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68197

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32440963

Pulled By: NivekT

fbshipit-source-id: 277cbe8d735afe341a7c189be20e1d334ecf9d4a
2021-12-02 07:27:17 -08:00
Kevin Tse
61a94495d9 [DataPipe] adding ZipperMapDataPipe (#68032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68032

Part of #57031

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32263058

Pulled By: NivekT

fbshipit-source-id: 13a30ee9d9779284a9fd9bb7222fc41253c6fe3b
2021-11-11 10:36:05 -08:00
Kevin Tse
803e88d418 [DataPipe] Fixing pickling issues with fork and demux (#67930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67930

Fixes #67848

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32222184

Pulled By: NivekT

fbshipit-source-id: 48871c45a855d92cd599e21f3b53827dd32c91ef
2021-11-09 07:54:02 -08:00
Jane Xu
39215ddf84 [skip ci] Set test owners for dataloader tests (#66839)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc SsnL VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66839

Reviewed By: ejguan

Differential Revision: D31761722

Pulled By: janeyx99

fbshipit-source-id: 8315ac03352c11b3215d89856b3cfda6cd78fa0c
2021-10-19 08:31:16 -07:00
Kevin Tse
8ebe1a924d [DataPipe] moving mux IterDataPipe test to the right location (#66277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277

Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485823

Pulled By: NivekT

fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
2021-10-08 08:32:29 -07:00