Fixes https://github.com/pytorch/data/issues/426
This PR introduces two main changes:
- It ensures the `ShufflerDataPipe` would share the same seed across distributed processes.
- Users can reset `shuffle` for persistent workers per epoch.
Detail:
- `shared_seed` is shared across distributed and worker processes. It will seed a `shared_rng` to provide seeds to each `ShufflerDataPipe` in the pipeline
- `worker_loop` now accepts a new argument of `shared_seed` to accept this shared seed.
- The `shared_seed` is attached to `_ResumeIteration` for resetting seed per epoch for `persistent worker`
- I choose not to touch `base_seed` simply for BC issue
I used this [script](https://gist.github.com/ejguan/d88f75fa822cb696ab1bc5bc25844f47) to test the result with `world_size=4`. Please check the result in: https://gist.github.com/ejguan/6ee2d2de12ca57f9eb4b97ef5a0e300b
You can see there isn't any duplicated/missing element for each epoch. And, with the same seed, the order of data remains the same across epochs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78765
Approved by: https://github.com/VitalyFedyunin
This is the first PR to make DataPipe deterministic.
Users should be able to use `torch.manual_seed(seed)` to control the shuffle order for the following cases:
- Directly over `DataPipe`
- For single-process DataLoader
- Multiprocessing DataLoader
Unfortunately, for distributed training, users have to run `apply_shuffle_seed` manually to make sure all distributed processes having the same order of shuffle.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77741
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384
OSS issue discussion: https://github.com/pytorch/data/issues/346
This diff updates `mux` and `mux_longest` data pipe.
`mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted.
`mux` example:
```
>>> from torchdata.datapipes.iter import IterableWrapper
>>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
>>> list(dp1.mux(dp2, dp3))
[0, 10, 20, 1, 11, 21, 2, 12, 22]
```
Test Plan:
buck test mode/opt //caffe2/test:datapipe
https://www.internalfb.com/intern/testinfra/testrun/4785074706282345
Differential Revision: D36017945
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77145
Approved by: https://github.com/NivekT, https://github.com/ejguan
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384
OSS issue discussion: https://github.com/pytorch/data/issues/346
This diff updates `mux` and `mux_longest` data pipe.
`mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted.
`mux` example:
```
>>> from torchdata.datapipes.iter import IterableWrapper
>>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
>>> list(dp1.mux(dp2, dp3))
[0, 10, 20, 1, 11, 21, 2, 12, 22]
```
Test Plan:
buck test mode/dev //pytorch/data/test:tests -- --exact 'pytorch/data/test:tests - test_mux_longest_iterdatapipe (test_datapipe.TestDataPipe)'
https://www.internalfb.com/intern/testinfra/testrun/3096224791148107
Reviewed By: ejguan
Differential Revision: D35799965
fbshipit-source-id: 320e71a342ec27e6e9200624aad42f4b99f97c3a
(cherry picked from commit 741ed595275df6c05026ed6f0e78d7052328fb7d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396
Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.
I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?
Fixes https://github.com/pytorch/data/issues/213
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D34481962
Pulled By: NivekT
fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
(cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73119
Test if a DataPipe is serializable after its contents are partially read and completely read. This is especially important for DataPipes with buffers.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D34354496
Pulled By: NivekT
fbshipit-source-id: 36971d68b9ca1de81fb254e9a459b8f54fe0f9ff
(cherry picked from commit e8f39a7aa364bd2b19145788f7e67c06f948f81b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72896
Fixing the issue described here: https://github.com/pytorch/data/issues/214
There will be a follow-up PR in TorchData as well
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D34258669
Pulled By: NivekT
fbshipit-source-id: 6dd88250ed14ebe779915dc46139be7e012e9d1b
(cherry picked from commit 025b8ed98019e576bfef04c33a3f33ed1a426a66)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72123
There is a bug to fix the typing system in DataPipe, which would take more than 1 week to fix. I will follow up on it later this month. As branch cut is today, add this PR to disable typing to make sure release works.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D33920610
Pulled By: ejguan
fbshipit-source-id: febff849ab2272fd3b1c5127a20f27eb82992d9c
(cherry picked from commit ee103e62e7)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70103
I used an argument so it can be disabled. I called it `deterministic_order` because `sort` can be confusing, as it's actually sorted but by dir levels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70435
Reviewed By: albanD
Differential Revision: D33899755
Pulled By: ejguan
fbshipit-source-id: e8a08f03a49120333b2d27f332cd21a3240a02a9
(cherry picked from commit 4616e43ec3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70215
A few renaming, formatting, and additional tests to make the unit tests better.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D33344610
Pulled By: NivekT
fbshipit-source-id: bb36f7452bdc44964c9ce0650c7ae308ba2c5aa5
(cherry picked from commit 0aae20cb27)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71161
Users should import these DataPipes from [TorchData](https://github.com/pytorch/data) if they would like to use them. We will be checking for any downstream library usage before landing this PR.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D33532272
Pulled By: NivekT
fbshipit-source-id: 9dbfb21baf2d1183e0aa379049ad8304753e08a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367
This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch.
Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue.
cc VitalyFedyunin ejguan NivekT pmeier Nayef211
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D33301648
Pulled By: NivekT
fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69391
As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `FilterIterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D32849462
Pulled By: NivekT
fbshipit-source-id: 91cf1dc03dd3d3cbd7a9c6ccbd791ade91355f30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69390
As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `MapperIterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D32849465
Pulled By: NivekT
fbshipit-source-id: 963ce70b84a7658331d126e5ed9fdb12273c8e1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277
Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485823
Pulled By: NivekT
fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275
Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485822
Pulled By: NivekT
fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340