Commit Graph

126 Commits

Author SHA1 Message Date
Kevin Tse
14b660fcc0 [DataPipe] Correct the type of exception that is being raised by ShufflerMapDataPipe (#82666)
Fixes https://github.com/pytorch/data/issues/708

The following code snippet used to fail, now it has been added as a test case:
```python
dp1 = dp.map.SequenceWrapper(range(10))
shuffle_dp1 = dp1.shuffle()
dp2 = dp.map.SequenceWrapper(range(10))
shuffle_dp2 = dp2.shuffle()
zip_dp = shuffle_dp1.zip(shuffle_dp2)
list(zip_dp)  # This used to fail
```

The issue was that `ShufflerMapDataPipe` raises a `KeyError` when an out of bound index is passed into it, but that was not handled by `zip_dp`'s `__getitem__` which only handled `IndexError`. With this change, it handles both.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82666
Approved by: https://github.com/ejguan
2022-08-03 19:05:17 +00:00
Kevin Tse
35d97e21c8 [DataPipe] Simple graph snapshotting (#79479)
This mostly completes the "poor man's snapshotting" implementation (named "simple snapshotting"). This is the most basic version of snapshotting but it should work for all DataPipes. I will be adding more efficient implementation for different types of DataPipes in future PRs.

### Implementation

The general idea of the simple snapshot is that we will:
1. Create a new iterator
2. Move that iterator forward by `n_iterations`
3. Save that as the `_fast_forward_iterator` of the DataPipe
4. The next time `iter` is called on the DataPipe, use the `_fast_forward_iterator`

### Usage
As of this implementation, the usage will something like:
```python
rng = torch.Generator()
initial_rng_state = rng.get_state()
datapipe: IterDataPipe = ...
# Some usage of the DataPipe, here maybe yielding the first 5 values
n_iter = 5
it = iter(datapipe)
for _ in range(n_iter):
    next(it)
serialized_graph = pickle.dumps(datapipe)

# The serialized object has most of the sufficient information for simple snapshot (except for initial RNG state)
# It can be deserialized at a later point in time or by a different process
deserialized_graph = pickle.loads(serialized_graph)
# I think `DataLoader2` or `ReadingService` should store `initial_rng_state` that can be saved by the API that we later use
rng_for_deserialized = torch.Generator()
rng_for_deserialized.set_state(initial_rng_state)
n_iterations = deserialized_graph._number_of_samples_yielded

_simple_snapshot_graph(deserialized_graph, n_iterations, rng=rng_for_deserialized)
# The while DataPipe graph should have the same state as before serialization, such that:
self.assertEqual(list(it), list(deserialized_graph))  # True
```

### Next Steps
If this looks acceptable, the next step is I will modify `DataLoader2`'s prototype ReadingService (the one with queues) to remember things like `initial_rng_state` and to have methods `save_snapshot` that will return the `(serialized graph, initial_rng)` and `restore_snapshot`. This should work for single worker data loading.

Note that, in the long term, `initial_rng_state` may not be necessary if we are able to directly save/restore the buffer and RNG state of `Shuffler` (that is work in progress). However, `initial_rng_state` and simple snapshot is still a good fall-back option for some edge cases where the buffer can't be stored.

Differential Revision: [D37943406](https://our.internmc.facebook.com/intern/diff/D37943406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79479
Approved by: https://github.com/ejguan
2022-07-23 02:53:15 +00:00
Kevin Tse
428e44ffa1 [DataPipe] Fixes various warnings, exceptions, and clean up testing (#81833)
I went through most of the warnings and exceptions raised in our tests to find these issues.

Changes:
1. In testing, `self.assertEquals` is deprecated, converting to `self.assertEqual` to get rid of the warning
2. Small changes for cleanliness and get rid of warnings (no actual change to result)
3. Correct `is_every_instance_exhausted` logic for `_Forker`
4. Catch `RunTimeError` raised by invalidated iterator during clean up
5. Check if attribute `parent_stream` exists before trying to access it

Differential Revision: [D38020122](https://our.internmc.facebook.com/intern/diff/D38020122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81833
Approved by: https://github.com/ejguan
2022-07-21 18:59:40 +00:00
erjia
ccbf04dd5f [DataPipe] Fix fork/unzip with a single child (#81502)
When `Forker` or `Unzipper` only contains a single child, the buffer should be cleaned up. This is one of the root causes for the issue reported internally. See: https://fburl.com/2k0et1gv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81502
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-07-18 16:53:19 +00:00
Erjia Guan
782f18e9b5 [DLv2] Make graph traverse working with unhashable DataPipe (#80509)
Summary:
This Diff removes the requirement for `traverse` function that `DataPipe` needs to be hash-able. `traverse` function now is using `id` of `DataPipe` instance rather than `DataPipe` itself as the key for both `cache` and graph.

But, it requires the changes of type of `DataPipeGraph` from `Dict[DataPipe, "DataPipeGraph"]` to `Dict[int, Tuple[DataPipe, "DataPipeGraph"]]`.

Differential Revision: D37354153

Ref PR in TorchData: https://github.com/pytorch/data/pull/559
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80509
Approved by: https://github.com/VitalyFedyunin
2022-07-12 14:47:42 +00:00
Vitaly Fedyunin
bcab5257de Expanding DataPipe to support DataFrames (#71931)
Differential Revision: [D37500516](https://our.internmc.facebook.com/intern/diff/D37500516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71931
Approved by: https://github.com/ejguan
2022-07-08 18:46:10 +00:00
Kevin Tse
b8e50f512f [DataPipe] Count number of successful yields for IterDataPipe (#79657)
This PR adds an attribute and logic to count the number of successful yields from `IterDataPipe`. This information can be useful to fast-forward a DataPipe (or the entire graph) back to a certain state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79657
Approved by: https://github.com/VitalyFedyunin
2022-06-28 17:30:33 +00:00
Erjia Guan
3d218e1c87 Raise warning for unpickable local function (#547) (#80232)
Summary:
X-link: https://github.com/pytorch/data/pull/547

Fixes https://github.com/pytorch/data/issues/538
- Improve the validation function to raise warning about unpickable function when either lambda or local function is provided to DataPipe.
- The inner function from functools.partial object is extracted as well for validation
- Mimic the behavior of pickle module for local lambda function: It would only raise Error for the local function rather than lambda function. So, we will raise warning about local function not lambda function.
```py

>>> import pickle
>>> def fn():
...     lf = lambda x: x
...     pickle.dumps(lf)
>>> pickle.dumps(fn)
AttributeError: Can't pickle local object 'fn.<locals>.<lambda>'
```

This Diff also fixes the Error introduced by https://github.com/pytorch/pytorch/pull/79344

Test Plan:
CI on PyTorch and TorchData
Manually validated the tests from TorchVision

Differential Revision: D37417556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80232
Approved by: https://github.com/NivekT
2022-06-27 21:47:09 +00:00
PyTorch MergeBot
fcdaf35114 Revert "Add validation for mapper function in datapipes with input_col (#79344)"
This reverts commit 787ac4edf8.

Reverted https://github.com/pytorch/pytorch/pull/79344 on behalf of https://github.com/ejguan due to This PR breaks multiple use cases and the CI from TorchVision becomes red
2022-06-24 17:17:33 +00:00
PyTorch MergeBot
79ba65c0f2 Revert "Raise warning for unpickable local function (#80140)"
This reverts commit 4b75b7d3c1.

Reverted https://github.com/pytorch/pytorch/pull/80140 on behalf of https://github.com/ejguan due to It will break the CI for TorchData
2022-06-24 14:49:06 +00:00
erjia
4b75b7d3c1 Raise warning for unpickable local function (#80140)
Fixes https://github.com/pytorch/data/issues/538

- Improve the validation function to raise warning about unpickable function when either lambda or local function is provided to `DataPipe`.
- The inner function from `functools.partial` object is extracted as well for validation
- Mimic the behavior of `pickle` module for local lambda function: It would only raise Error for the local function rather than `lambda` function. So, we will raise warning about local function not lambda function.
```py
>>> import pickle
>>> def fn():
...     lf = lambda x: x
...     pickle.dumps(lf)
>>> pickle.dumps(fn)
AttributeError: Can't pickle local object 'fn.<locals>.<lambda>'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80140
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-06-24 13:50:51 +00:00
Robert
787ac4edf8 Add validation for mapper function in datapipes with input_col (#79344)
As linked in https://github.com/pytorch/data/issues/362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79344
Approved by: https://github.com/ejguan, https://github.com/NivekT
2022-06-23 18:49:35 +00:00
Robert Xiu
9fca008809 [DataPipe] Adding functional API for FileLister (#78419)
Fixes #78263

Follow-up from pytorch/data#387. This adds a functional API `list_files()` to `FileListerDataPipe`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78419
Approved by: https://github.com/NivekT, https://github.com/ejguan
2022-06-06 17:26:19 +00:00
erjia
9b6cb83b0c Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765)
Fixes https://github.com/pytorch/data/issues/426

This PR introduces two main changes:
- It ensures the `ShufflerDataPipe` would share the same seed across distributed processes.
- Users can reset `shuffle` for persistent workers per epoch.

Detail:
- `shared_seed` is shared across distributed and worker processes. It will seed a `shared_rng` to provide seeds to each `ShufflerDataPipe` in the pipeline
- `worker_loop` now accepts a new argument of `shared_seed` to accept this shared seed.
- The `shared_seed` is attached to `_ResumeIteration` for resetting seed per epoch for `persistent worker`
- I choose not to touch `base_seed` simply for BC issue

I used this [script](https://gist.github.com/ejguan/d88f75fa822cb696ab1bc5bc25844f47) to test the result with `world_size=4`. Please check the result in: https://gist.github.com/ejguan/6ee2d2de12ca57f9eb4b97ef5a0e300b

You can see there isn't any duplicated/missing element for each epoch. And, with the same seed, the order of data remains the same across epochs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78765
Approved by: https://github.com/VitalyFedyunin
2022-06-06 17:24:00 +00:00
PyTorch MergeBot
129d9dbb15 Revert "Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765)"
This reverts commit b769a0e18b.

Reverted https://github.com/pytorch/pytorch/pull/78765 on behalf of https://github.com/janeyx99 due to broke lint on trunk
2022-06-06 14:24:51 +00:00
erjia
b769a0e18b Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765)
Fixes https://github.com/pytorch/data/issues/426

This PR introduces two main changes:
- It ensures the `ShufflerDataPipe` would share the same seed across distributed processes.
- Users can reset `shuffle` for persistent workers per epoch.

Detail:
- `shared_seed` is shared across distributed and worker processes. It will seed a `shared_rng` to provide seeds to each `ShufflerDataPipe` in the pipeline
- `worker_loop` now accepts a new argument of `shared_seed` to accept this shared seed.
- The `shared_seed` is attached to `_ResumeIteration` for resetting seed per epoch for `persistent worker`
- I choose not to touch `base_seed` simply for BC issue

I used this [script](https://gist.github.com/ejguan/d88f75fa822cb696ab1bc5bc25844f47) to test the result with `world_size=4`. Please check the result in: https://gist.github.com/ejguan/6ee2d2de12ca57f9eb4b97ef5a0e300b

You can see there isn't any duplicated/missing element for each epoch. And, with the same seed, the order of data remains the same across epochs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78765
Approved by: https://github.com/VitalyFedyunin
2022-06-06 13:36:37 +00:00
Kevin Tse
b4a6730ce1 [DataPipe] Refactor 'mux' to have buffer as an instance variable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77775

Approved by: https://github.com/ejguan
2022-05-19 19:55:27 +00:00
erjia
99f6e614e8 Seed Shuffler for MP DataLoader without explicit manual_seed. (#77855)
Follow up on https://github.com/pytorch/pytorch/pull/77741

This PR guarantees the `Shuffler` in first iteration with MP DataLoader has the same seed across worker processes when users don't specify the seed.
Check newly added tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77855
Approved by: https://github.com/NivekT
2022-05-19 17:28:26 +00:00
erjia
365ce350cb Make ShufflerDataPipe deterministic for SP & MP DataLoader (#77741)
This is the first PR to make DataPipe deterministic.

Users should be able to use `torch.manual_seed(seed)` to control the shuffle order for the following cases:
- Directly over `DataPipe`
- For single-process DataLoader
- Multiprocessing DataLoader

Unfortunately, for distributed training, users have to run `apply_shuffle_seed` manually to make sure all distributed processes having the same order of shuffle.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77741
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-05-18 23:32:07 +00:00
Ning Li (Seattle)
4d1ead6dff [DataPipe] Update mux data pipe (#76384) (#77145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384

OSS issue discussion: https://github.com/pytorch/data/issues/346
This diff updates `mux` and `mux_longest` data pipe.
`mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted.

`mux` example:

```
>>> from torchdata.datapipes.iter import IterableWrapper
>>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
>>> list(dp1.mux(dp2, dp3))
[0, 10, 20, 1, 11, 21, 2, 12, 22]
```

Test Plan:
buck test mode/opt //caffe2/test:datapipe

https://www.internalfb.com/intern/testinfra/testrun/4785074706282345

Differential Revision: D36017945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77145
Approved by: https://github.com/NivekT, https://github.com/ejguan
2022-05-18 16:23:07 +00:00
Kevin Tse
bbaefdf6b5 [DataPipe] Enforcing single valid iterator for IterDataPipes multiple DataPipes as outputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75995

Approved by: https://github.com/VitalyFedyunin
2022-05-18 01:31:39 +00:00
Kevin Tse
7c52f204e0 [DataPipe] Enforcing single valid iterator for IterDataPipes without multiple outputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70479

Approved by: https://github.com/ejguan
2022-05-18 01:31:38 +00:00
Vitaly Fedyunin
edffd595c2 [DataLoader] Adding ability to use dill to pass DataPipes in mutiprocessing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77288

Approved by: https://github.com/ejguan, https://github.com/NivekT
2022-05-15 23:04:03 +00:00
Kevin Tse
a008d19ff7 [DataPipe] Revamp serialization logic of DataPipes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74984

Approved by: https://github.com/ejguan
2022-05-10 16:16:46 +00:00
zengk95
ef63408853 Revert [DataPipe] Update mux data pipe
Reverts #76384

this this is breaking tests test_demux_mux_datapipe (__main__.TestIterableDataPipeBasic. See logs: a997046017
and was red on the PR as well: https://hud.pytorch.org/pytorch/pytorch/pull/76384
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76507
Approved by: https://github.com/kit1980
2022-04-28 00:06:30 +00:00
Ning Li (Seattle)
a997046017 [DataPipe] Update mux data pipe (#76384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384

OSS issue discussion: https://github.com/pytorch/data/issues/346
This diff updates `mux` and `mux_longest` data pipe.
`mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted.

`mux` example:

```
>>> from torchdata.datapipes.iter import IterableWrapper
>>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25))
>>> list(dp1.mux(dp2, dp3))
[0, 10, 20, 1, 11, 21, 2, 12, 22]
```

Test Plan:
buck test mode/dev //pytorch/data/test:tests -- --exact 'pytorch/data/test:tests - test_mux_longest_iterdatapipe (test_datapipe.TestDataPipe)'

https://www.internalfb.com/intern/testinfra/testrun/3096224791148107

Reviewed By: ejguan

Differential Revision: D35799965

fbshipit-source-id: 320e71a342ec27e6e9200624aad42f4b99f97c3a
(cherry picked from commit 741ed595275df6c05026ed6f0e78d7052328fb7d)
2022-04-27 22:10:42 +00:00
erjia
0ff05b1e97 [DataPipe] Add funtional API docstring and fix typo in test
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76272
Approved by: https://github.com/ishaan-mehta, https://github.com/NivekT
2022-04-25 14:16:53 +00:00
Kevin Tse
383f026791 [DataPipe] Enabling graph traversal for MapDataPipe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74851

Approved by: https://github.com/ejguan
2022-04-22 18:06:16 +00:00
erjia
ec591087fb [DataPipe] Add input_col to filter and add deprecation warning for DataPipe arguments
Last patch to align DataPipe API with TorchArrow DataFrame

For deprecation warning of DataPipe argument:
```
The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76060
Approved by: https://github.com/NivekT
2022-04-22 17:49:39 +00:00
erjia
b8cce8847f [DataPipe] Add functional API to StreamReader and FileOpener
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76233
Approved by: https://github.com/NivekT
2022-04-22 17:49:26 +00:00
erjia
841a7f5187 [DataPipe] apply dill serialization for _Demux and add cache to traverse
- Fix _Demux can not be pickled with DILL presented https://github.com/pytorch/pytorch/pull/74958#issuecomment-1084637227
- And add cache to traverse function to prevent infinite recursion for circular reference of DataPipe (Fixes https://github.com/pytorch/data/issues/237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75034
Approved by: https://github.com/wenleix
2022-04-04 19:45:14 +00:00
Kevin Tse
4c5d532728 [DataPipe] only apply special serialization when dill is installed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74958

Approved by: https://github.com/ejguan
2022-03-30 20:38:05 +00:00
Nicolas Hug
5667c4ea21 Remove default parameter of ShufflerIterDataPipe (#74370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74370

Closes https://github.com/pytorch/data/issues/298. This PR:

- removes the `default` parameter of `ShufflerIterDataPipe`
- renames `set_shuffle_setting()` into `set_shuffle()`
- let `set_shuffle()` return `self`.

Test Plan: Imported from OSS

Reviewed By: george-qi

Differential Revision: D35073666

Pulled By: NicolasHug

fbshipit-source-id: 9847b037e70f44f36eaf4471f2c12fa8ec2ed73c
(cherry picked from commit b07ab646f308532886e8daddd57e937a53edb153)
2022-03-28 12:47:24 +00:00
Kevin Tse
eec994fc16 [DataPipe] Separating DataPipes from Dataset into different files (#73396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396

Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.

I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?

Fixes https://github.com/pytorch/data/issues/213

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34481962

Pulled By: NivekT

fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
(cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)
2022-03-15 14:46:34 +00:00
Kevin Tse
8811d217ed [DataPipe] Slight refactoring IterDataPipe serialization test (#73922)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73922

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34732288

Pulled By: NivekT

fbshipit-source-id: f31229332fe4eac85cc2085484f6e1b1d802987d
(cherry picked from commit ace20054e4f3f9bd9610640755400fbde82650c3)
2022-03-09 15:33:12 +00:00
Kevin Tse
0821154072 [DataPipe] Adding serialization test for all MapDataPipe (#73921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73921

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34732286

Pulled By: NivekT

fbshipit-source-id: 893af2fbb83feb1bae226d3205105de5d3836378
(cherry picked from commit f44fd3c5210d0afdbf826e3b7e7fbe2ec216c3b7)
2022-03-09 15:33:12 +00:00
Kevin Tse
f85309e478 [DataPipe] Adding serialization test at different stages of reading for IterDataPipes (#73119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73119

Test if a DataPipe is serializable after its contents are partially read and completely read. This is especially important for DataPipes with buffers.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34354496

Pulled By: NivekT

fbshipit-source-id: 36971d68b9ca1de81fb254e9a459b8f54fe0f9ff
(cherry picked from commit e8f39a7aa364bd2b19145788f7e67c06f948f81b)
2022-02-23 16:31:21 +00:00
Kevin Tse
cd4ecce1bb [DataPipe] Fix issue with DataPipe serialization with dill (#72896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72896

Fixing the issue described here: https://github.com/pytorch/data/issues/214

There will be a follow-up PR in TorchData as well

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D34258669

Pulled By: NivekT

fbshipit-source-id: 6dd88250ed14ebe779915dc46139be7e012e9d1b
(cherry picked from commit 025b8ed98019e576bfef04c33a3f33ed1a426a66)
2022-02-23 16:31:20 +00:00
Erjia Guan
6297aa114f [DataPipe] Extend FileLister to support load multiple directories (#72260)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72260

Test Plan: Imported from OSS

Reviewed By: dagitses, NivekT

Differential Revision: D33979744

Pulled By: ejguan

fbshipit-source-id: 5733d20382642fc2274afd838b33c98150d81e91
(cherry picked from commit f70537ae76)
2022-02-04 07:55:00 +00:00
Erjia Guan
7b014cc645 [DataPipe] Disable Typing for DataPipe before branch cut (#72123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72123

There is a bug to fix the typing system in DataPipe, which would take more than 1 week to fix. I will follow up on it later this month. As branch cut is today, add this PR to disable typing to make sure release works.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33920610

Pulled By: ejguan

fbshipit-source-id: febff849ab2272fd3b1c5127a20f27eb82992d9c
(cherry picked from commit ee103e62e7)
2022-02-02 05:00:41 +00:00
Santiago Castro
5024c1bc7b Make get_file_pathnames_from_root output order deterministic (#70435)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70103

I used an argument so it can be disabled. I called it `deterministic_order` because `sort` can be confusing, as it's actually sorted but by dir levels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70435

Reviewed By: albanD

Differential Revision: D33899755

Pulled By: ejguan

fbshipit-source-id: e8a08f03a49120333b2d27f332cd21a3240a02a9
(cherry picked from commit 4616e43ec3)
2022-02-01 18:12:23 +00:00
Vitaly Fedyunin
b36b11cbc1 Separating CaptureDataFrame out of DFIterDataPipe (#71776)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71776

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33771602

Pulled By: VitalyFedyunin

fbshipit-source-id: 59d85bc707a9568f1f0960fc184113a4f422d2df
(cherry picked from commit 93522768ef)
2022-01-26 03:25:02 +00:00
Erjia Guan
bb157dd4eb Make methods of internal file_obj visible from StreamWrapper (#71653)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71653

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33718749

Pulled By: ejguan

fbshipit-source-id: f3a8244f22ca37049b8678afa0e329b23c957a9d
(cherry picked from commit a4d12ca48e)
2022-01-25 15:34:24 +00:00
Kevin Tse
13ea2cb330 [DataPipe] Make GroupBy serializable with lambda function (#71497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71497

Related to https://github.com/pytorch/data/issues/172

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33668749

Pulled By: NivekT

fbshipit-source-id: 6506614e9d4389dc645d8985c00fdb3402122d9b
(cherry picked from commit 458e76fcb1)
2022-01-21 16:04:45 +00:00
Kevin Tse
36b4c95e74 [DataPipe] adding serialization test for all core IterDataPipes (#71456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71456

Related to https://github.com/pytorch/data/issues/172

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33668748

Pulled By: NivekT

fbshipit-source-id: ea2085d5ed47533ca49258cc52471373c6ae1847
(cherry picked from commit d5f6fde1d0)
2022-01-21 16:04:45 +00:00
Kevin Tse
011fd1d933 [DataPipe] improving DataPipe unit tests (#70215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70215

A few renaming, formatting, and additional tests to make the unit tests better.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33344610

Pulled By: NivekT

fbshipit-source-id: bb36f7452bdc44964c9ce0650c7ae308ba2c5aa5
(cherry picked from commit 0aae20cb27)
2022-01-20 15:49:53 +00:00
Erjia Guan
fd9e08df5d Make Demux serializable with lambda function (#71311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71311

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33584552

Pulled By: ejguan

fbshipit-source-id: 52324faf5547f9f77582ec170ec91ce3114cfc61
2022-01-18 06:47:54 -08:00
Kevin Tse
1e3893ecbb [DataPipe] Removing deprecated DataPipes (#71161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71161

Users should import these DataPipes from [TorchData](https://github.com/pytorch/data) if they would like to use them. We will be checking for any downstream library usage before landing this PR.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33532272

Pulled By: NivekT

fbshipit-source-id: 9dbfb21baf2d1183e0aa379049ad8304753e08a1
2022-01-13 07:37:48 -08:00
Kevin Tse
8dcfdf39e7 [DataPipe] Renaming FileLoader to FileOpener with deprecation warning for FileLoader (#70367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367

This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch.

Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue.

cc VitalyFedyunin ejguan NivekT pmeier Nayef211

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33301648

Pulled By: NivekT

fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b
2022-01-04 09:14:50 -08:00
Kevin Tse
ad0cd8a76e [DataPipe] Improve inline doc and testing for CollatorIterDataPipe (#70139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70139

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33199107

Pulled By: NivekT

fbshipit-source-id: f96d77490998ac9bc3da8d4ff1a9caa08e9e7f27
2021-12-20 08:05:21 -08:00