Commit Graph

501 Commits

Author SHA1 Message Date
Kevin Tse
64a526d4af [DataLoader] Replacing traverse function with traverse_datapipes (#85667)
This PR deprecates `traverse` function and replaces it with `traverse_datapipes` instead.

While use `DataLoader`, I realized that it is raising `FutureWarning` even though I am not explicitly using `traverse`. What is happening is that `DataLoader` invokes `traverse(dp, only_datapipe=True)`, and the usage of the keyword causes the `only_datapipe` warning to be raised.

```
/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/utils/data/graph.py:102: FutureWarning: `only_datapipe` is deprecated from `traverse` function and will be removed after 1.13.
  warnings.warn(msg, FutureWarning)
```

A few things we'd like to do:
1. Deprecate the key word arg `only_datapipe`
2. Change the default behavior from `only_datapipe=False` to `only_datapipe=True` in the future
3. Do not raise a warning when users are using the function correctly

This creates a paradox it is impossible for the users to change their code to match the future default behavior (i.e. call `traverse(dp)` without `only_datapipe`):
  - they cannot do so because the default behavior of `traverse` hasn't changed yet, so they must use `only_datapipe=True`
  - if they use `only_datapipe=True`, eventually the kwarg will go away and cause a runtime error; they also get a `FutureWarning` in the present

IIUC, there doesn't seem to be a way to accomplish those 3 goals without replacing the function with a new one that has a different name; hence, this PR. Let me know if there is a better alternative.

If this looks right, I will send a follow up PR in `TorchData`.

Differential Revision: [D39832183](https://our.internmc.facebook.com/intern/diff/D39832183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85667
Approved by: https://github.com/ejguan
2022-09-27 19:58:15 +00:00
Erjia Guan
f1a6f32b72 [DataLoader] Make distributed lazily initialized & share seed via PG (#85279)
Fixes #84492 https://github.com/pytorch/data/issues/772

## Changes
- Move the logic of distributed sharding from the constructor of DataLoader to the constructor of DataLoaderIterator. This would prevent the Error caused by lazy distributed process initialization
- Replace distributed store by process group (`gloo`) to share the random seed because `mpi` backend doesn't provide distributed store.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85279
Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin
2022-09-23 18:52:52 +00:00
Erjia Guan
ea72a0991c Add support to traverse all python collection objects (#84079)
Fixes https://github.com/pytorch/data/issues/752

This PR makes `traverse` function supporting more collections data structures from Python. The `getstate_hook` will be invoked after custom `__getstate__` function. This would guarantee that `traverse` function will be working as long as the `DataPipe` is working properly with multiprocessing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079
Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin
2022-09-23 16:21:25 +00:00
Xu Zhao
52a2b61203 Fix fetch function which breaks user code (#85099)
The [fastNLP](https://github.com/fastnlp/fastNLP/blob/v0.6.0/fastNLP/core/batch.py#L51) model uses DataSetGetter to fetch data from the dataset. The following code breaks because of https://github.com/pytorch/pytorch/pull/84301:

```
from fastNLP.io.pipe.qa import CMRC2018BertPipe
input_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), ".data", "cmrc2018-sim")
data_bundle = CMRC2018BertPipe().process_from_file(paths=input_dir)
data_bundle.rename_field('chars', 'words')
data_bundle.get_dataset('dev')
dataset = DataSetGetter(dataset, as_numpy)
dataiter = torch.utils.data.DataLoader(dataset=dataset)
for batch in dataiter:
    # data-processing...
```

This is because for the `DataSetGetter` class, the following condition holds:
```
# hasattr(dataset_getter, '__getitems__') == True
# dataset_getter.__getitems__ == None
```

This PR adds an additional check to make sure `__getitems__` is only called when it is not None.

This error was found by the torchbench nightly CI, original error stack trace:
```
ERROR: test_fastNLP_Bert_train_cuda (__main__.TestBenchmark)
----------------------------------------------------------------------
components._impl.workers.subprocess_rpc.ChildTraceException: Traceback (most recent call last):
  File "/home/circleci/project/components/_impl/workers/subprocess_rpc.py", line 470, in _run_block
    exec(  # noqa: P204
  File "<subprocess-worker>", line 35, in <module>
  File "<subprocess-worker>", line 12, in _run_in_worker_f
  File "/home/circleci/project/torchbenchmark/util/model.py", line 16, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 93, in __init__
    self.example_inputs = self._prefetch(example_inputs)
  File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 133, in _prefetch
    for batch_x, batch_y in example_inputs:
  File "/home/circleci/miniconda3/lib/python3.8/site-packages/fastNLP/core/batch.py", line 266, in __iter__
    for indices, batch_x, batch_y in self.dataiter:
  File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 719, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
TypeError: 'NoneType' object is not callable
```

Full error log: https://app.circleci.com/pipelines/github/pytorch/benchmark/5143/workflows/0676f36d-0ab4-42bd-adb4-90e6b0df76d1/jobs/5293
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85099
Approved by: https://github.com/ejguan
2022-09-15 21:48:28 +00:00
erjia
33bb8ae350 Set shuffle to DataPipes with set_shuffle API (#83741)
This PR requires PR is landed: https://github.com/pytorch/pytorch/pull/83202

## changes
- For `apply_shuffle_setting` and `apply_shuffle_seed`, it makes sure it will apply shuffle setting to each of DataPipe that contains a method called `set_shuffle` or `set_seed`.
- Change the API from `apply_shuffle_seed` to `apply_random_seed`.
- Fix a bug that `apply_shuffle_seed` only accepts DataPipe that is hashable. After the PR, this function uses `id` to prevent seeding the same DataPipe multiple times per epoch.
- Fix another bug from `shuffler` that `reset` with `_enable=False` would also reset `_seed`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83741
Approved by: https://github.com/NivekT
2022-09-13 13:38:58 +00:00
Kevin Tse
27e5299ee3 [DataPipe] Fix mishandling of exception message when error is not iterable (#84676)
We sometimes get an exception message like this:
```
This exception is thrown by __iter__ of TarArchiveLoaderIterDataPipe(datapipe=FileOpenerIterDataPipe, length=-1, mode='r:')    elif msg not in e.args[0] and single_iterator_msg not in e.args[0]:

TypeError: argument of type 'int' is not iterable
```

The `TypeError` raised by the mishandling of the error message obfuscates the true exception, which now will be show as:
```
FileNotFoundError: [Errno 2] No such file or directory:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84676
Approved by: https://github.com/ejguan
2022-09-09 14:34:13 +00:00
Junteng Jia
335033f718 asyncio increase throughput (pytorch change) (#84301)
Summary: This diffs add a check in the fetcher, that if the dataset to be fetched has a function "getitems" then use it for fetching a batch of elements, as oppose to one by one. This is benefical for io bounded usage.

Differential Revision: D39145980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84301
Approved by: https://github.com/VitalyFedyunin
2022-09-08 17:00:45 +00:00
Kevin Tse
cfb9d0d233 [DataPipe] Fixing map function signature validation (#84279)
As @pmeier [points out](https://github.com/pytorch/pytorch/pull/80267#discussion_r958423241), #80267 introduces a bug where an exception is thrown when a built-in function (or a function implemented in C) is used with `.map` because `inspect.signature(fn)` cannot find the function's signature.

This PR skips over a function when its signature cannot be found. I believe this case is rare, and if the `fn` is truly incompatible with the usage of `input_col`/`output_col`, an exception will be raised at run time such that users will be able to examine what is wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84279
Approved by: https://github.com/pmeier, https://github.com/janeyx99
2022-08-31 19:55:01 +00:00
erjia
3f94726453 [DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202)
Fixes: https://github.com/pytorch/data/issues/718

This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202
Approved by: https://github.com/NivekT
2022-08-29 08:57:17 +00:00
PyTorch MergeBot
d50aa517b5 Revert "Add support to traverse all python collection objects (#84079)"
This reverts commit e0f0c8e7b9.

Reverted https://github.com/pytorch/pytorch/pull/84079 on behalf of https://github.com/weiwangmeta due to Diff reverted internally
2022-08-29 06:34:50 +00:00
PyTorch MergeBot
7244a3737c Revert "[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202)"
This reverts commit a423c966a7.

Reverted https://github.com/pytorch/pytorch/pull/83202 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-28 18:00:17 +00:00
erjia
a423c966a7 [DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202)
Fixes: https://github.com/pytorch/data/issues/718

This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974

This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle`
- Lazily generating seed per iteration
- Each iterators has a new seed
- Convert `MapDataPipe.shuffle` to an `IterDataPipe`

## BC-breaking Note:
This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`.

### 1. 12
Output as `MapDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
```

### This PR:
Output as `IterDataPipe`
```
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202
Approved by: https://github.com/NivekT
2022-08-26 23:33:20 +00:00
erjia
e0f0c8e7b9 Add support to traverse all python collection objects (#84079)
Fixes https://github.com/pytorch/data/issues/752

This PR makes `traverse` function supporting more collections data structures from Python. Please let me know if anyone has a better idea about how to elegantly check if the object is a collection then we can dive into this object to see wether there is any DataPipe wrapped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079
Approved by: https://github.com/NivekT
2022-08-26 21:02:43 +00:00
erjia
4c19981316 [DataPipe] Reset Shuffler's iterator when NotStarted (#83535)
This PR changes the behavior of `IterDataPipe` to always invoke `reset` for the state of `NotStarted`. The main reason is we normally put lazy initialization code into `reset` function. Even for the state of `NotStarted`, we should invoke `reset` to initialize those lazy variables. Otherwise, we have to manually determine if the state is `NotStarted` or `Iterating` in `__iter__` function and only manually invoke `reset` in the state of `NotStarted`.

This PR also makes `Shuffler` is able to serialize with `buffer` and `rng_state`.

The following part is removed:

~I am also add `_snapshot_state` into serialization state and during `__setstate__` only change the state to `Restored` if the original state is `Iterating`. Especially, for the case of deserializing/serializing `NotStarted` DataPipe (multiprocessing), we would invoke `set_seed` for `Shuffler`. We need the `DataPipe` remains as `NotStarted` to properly `reset`.~

I am listing all the expected behavior state transition below:
- Initial state: `NotStarted`
  - `iter` -> Call `reset` and change the state to `Iterating`
  - serialize/deserialize -> Keep the state as `NotStarted` (will `reset` if `iter` is called afterwards)
- Initial state: `Iterating`
  - `iter` -> Call `reset` and keep the state to `Iterating`
  - serialize/deserialize -> Change the state as `Restored`
- Initial state: `Restored`
  - `iter` -> Only change the state to `Iterating`
  - serialize/deserialize -> Not allowed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83535
Approved by: https://github.com/NivekT
2022-08-25 19:45:41 +00:00
erjia
56fef4e6ee fix NoneType object has no attribute python_exit_status (#83985)
Fixes #83791

Prevents the Error when `_utils` has been cleared by Python before `__del__` is invoked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83985
Approved by: https://github.com/NivekT
2022-08-25 16:05:14 +00:00
Robert
5c49c7bbba [WIP] Validating input_col for certain datapipes (#80267)
Follow up from #79344.

Currently WIP due to multiple test failures.

Waiting for #80140 to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267
Approved by: https://github.com/ejguan
2022-08-24 17:34:28 +00:00
joncrall
b136f3f310 More doctest refinements. (#83317)
Follow up to #82797

Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way.

@ezyang @vadimkantorov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317
Approved by: https://github.com/ezyang
2022-08-22 20:07:26 +00:00
albanD
3834836260 [DataLoader] Move loop content into a function to ensure we don't preserve anything (#83595)
Can lead to CPU memory saving as we don't hold onto the pin memory buffer as long as we used to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83595
Approved by: https://github.com/ejguan, https://github.com/NivekT
2022-08-18 20:54:47 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Kevin Tse
14b660fcc0 [DataPipe] Correct the type of exception that is being raised by ShufflerMapDataPipe (#82666)
Fixes https://github.com/pytorch/data/issues/708

The following code snippet used to fail, now it has been added as a test case:
```python
dp1 = dp.map.SequenceWrapper(range(10))
shuffle_dp1 = dp1.shuffle()
dp2 = dp.map.SequenceWrapper(range(10))
shuffle_dp2 = dp2.shuffle()
zip_dp = shuffle_dp1.zip(shuffle_dp2)
list(zip_dp)  # This used to fail
```

The issue was that `ShufflerMapDataPipe` raises a `KeyError` when an out of bound index is passed into it, but that was not handled by `zip_dp`'s `__getitem__` which only handled `IndexError`. With this change, it handles both.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82666
Approved by: https://github.com/ejguan
2022-08-03 19:05:17 +00:00
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
Kevin Tse
35d97e21c8 [DataPipe] Simple graph snapshotting (#79479)
This mostly completes the "poor man's snapshotting" implementation (named "simple snapshotting"). This is the most basic version of snapshotting but it should work for all DataPipes. I will be adding more efficient implementation for different types of DataPipes in future PRs.

### Implementation

The general idea of the simple snapshot is that we will:
1. Create a new iterator
2. Move that iterator forward by `n_iterations`
3. Save that as the `_fast_forward_iterator` of the DataPipe
4. The next time `iter` is called on the DataPipe, use the `_fast_forward_iterator`

### Usage
As of this implementation, the usage will something like:
```python
rng = torch.Generator()
initial_rng_state = rng.get_state()
datapipe: IterDataPipe = ...
# Some usage of the DataPipe, here maybe yielding the first 5 values
n_iter = 5
it = iter(datapipe)
for _ in range(n_iter):
    next(it)
serialized_graph = pickle.dumps(datapipe)

# The serialized object has most of the sufficient information for simple snapshot (except for initial RNG state)
# It can be deserialized at a later point in time or by a different process
deserialized_graph = pickle.loads(serialized_graph)
# I think `DataLoader2` or `ReadingService` should store `initial_rng_state` that can be saved by the API that we later use
rng_for_deserialized = torch.Generator()
rng_for_deserialized.set_state(initial_rng_state)
n_iterations = deserialized_graph._number_of_samples_yielded

_simple_snapshot_graph(deserialized_graph, n_iterations, rng=rng_for_deserialized)
# The while DataPipe graph should have the same state as before serialization, such that:
self.assertEqual(list(it), list(deserialized_graph))  # True
```

### Next Steps
If this looks acceptable, the next step is I will modify `DataLoader2`'s prototype ReadingService (the one with queues) to remember things like `initial_rng_state` and to have methods `save_snapshot` that will return the `(serialized graph, initial_rng)` and `restore_snapshot`. This should work for single worker data loading.

Note that, in the long term, `initial_rng_state` may not be necessary if we are able to directly save/restore the buffer and RNG state of `Shuffler` (that is work in progress). However, `initial_rng_state` and simple snapshot is still a good fall-back option for some edge cases where the buffer can't be stored.

Differential Revision: [D37943406](https://our.internmc.facebook.com/intern/diff/D37943406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79479
Approved by: https://github.com/ejguan
2022-07-23 02:53:15 +00:00
Kevin Tse
428e44ffa1 [DataPipe] Fixes various warnings, exceptions, and clean up testing (#81833)
I went through most of the warnings and exceptions raised in our tests to find these issues.

Changes:
1. In testing, `self.assertEquals` is deprecated, converting to `self.assertEqual` to get rid of the warning
2. Small changes for cleanliness and get rid of warnings (no actual change to result)
3. Correct `is_every_instance_exhausted` logic for `_Forker`
4. Catch `RunTimeError` raised by invalidated iterator during clean up
5. Check if attribute `parent_stream` exists before trying to access it

Differential Revision: [D38020122](https://our.internmc.facebook.com/intern/diff/D38020122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81833
Approved by: https://github.com/ejguan
2022-07-21 18:59:40 +00:00
erjia
aa1466d542 Raise proper timeout when sharing the distributed shared seed (#81666)
Fixes https://github.com/pytorch/data/issues/659

- This would fix the problem that a slow DataLoader on rank 0 would cause TimeoutError as I have removed the `wait` operation on other Ranks.
- This PR also adds a [default timeout](f6a45f7984/torch/csrc/distributed/c10d/ProcessGroup.hpp (L26-L27)) as 30 * 60 seconds (taking reference from the distributed team's implementation). When the distributed seed is stuck on any rank, a proper timeout with detailed message will be raised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81666
Approved by: https://github.com/NivekT
2022-07-19 17:21:02 +00:00
erjia
ccbf04dd5f [DataPipe] Fix fork/unzip with a single child (#81502)
When `Forker` or `Unzipper` only contains a single child, the buffer should be cleaned up. This is one of the root causes for the issue reported internally. See: https://fburl.com/2k0et1gv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81502
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-07-18 16:53:19 +00:00
Matthew Caseres
00359ff886 Fix docstring on FileOpenerIterDataPipe (#81407)
It said the default argument was `b` not `r`

Fixes #81406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81407
Approved by: https://github.com/kit1980
2022-07-16 01:01:39 +00:00
erjia
2f5d4cf90c Fix mypy for IterDataPipe.collate (#81275)
Add `default_collate` to mypy stub file to make sure `default_collate` is imported for `IterDataPipe.collate`

Sister PR from TorchData: https://github.com/pytorch/data/pull/645
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81275
Approved by: https://github.com/NivekT
2022-07-13 15:54:14 +00:00
Erjia Guan
782f18e9b5 [DLv2] Make graph traverse working with unhashable DataPipe (#80509)
Summary:
This Diff removes the requirement for `traverse` function that `DataPipe` needs to be hash-able. `traverse` function now is using `id` of `DataPipe` instance rather than `DataPipe` itself as the key for both `cache` and graph.

But, it requires the changes of type of `DataPipeGraph` from `Dict[DataPipe, "DataPipeGraph"]` to `Dict[int, Tuple[DataPipe, "DataPipeGraph"]]`.

Differential Revision: D37354153

Ref PR in TorchData: https://github.com/pytorch/data/pull/559
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80509
Approved by: https://github.com/VitalyFedyunin
2022-07-12 14:47:42 +00:00
Vitaly Fedyunin
e9b3bc2ead [DataLoader] Locking lower ranks seed recepients (#81071)
Exit seed receiving section only when all ranks received seed, otherwise we are at risk that current rank
will reach same section of the code again while rank zero still in the previous iteration

Fixes: #80845

Differential Revision: [D37702557](https://our.internmc.facebook.com/intern/diff/D37702557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81071
Approved by: https://github.com/msaroufim, https://github.com/ejguan
2022-07-08 18:53:45 +00:00
Vitaly Fedyunin
bcab5257de Expanding DataPipe to support DataFrames (#71931)
Differential Revision: [D37500516](https://our.internmc.facebook.com/intern/diff/D37500516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71931
Approved by: https://github.com/ejguan
2022-07-08 18:46:10 +00:00
Vitaly Fedyunin
331c0c1803 [DataLoader] Close open in DataPipe streams on best effort basis (#78952)
Adding ability to:
- Track open StreamWrappers with `StreamWrapper.session_streams`
- Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children)
- Close streams in discarded by filtering structures

Differential Revision: [D37489935](https://our.internmc.facebook.com/intern/diff/D37489935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78952
Approved by: https://github.com/ejguan
2022-06-29 20:11:23 +00:00
Kevin Tse
b8e50f512f [DataPipe] Count number of successful yields for IterDataPipe (#79657)
This PR adds an attribute and logic to count the number of successful yields from `IterDataPipe`. This information can be useful to fast-forward a DataPipe (or the entire graph) back to a certain state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79657
Approved by: https://github.com/VitalyFedyunin
2022-06-28 17:30:33 +00:00
erjia
3ec9d34f21 Fix distributed store to use add for the counter of DL shared seed (#80348)
In order to get the result of `_shared_seed_recv_cnt` properly, switch from `store.get` to `store.add(key, 0)`.

See the comment from distributed team for the reason:
590d3e5774/torch/distributed/distributed_c10d.py (L242-L246)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80348
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-06-27 21:59:17 +00:00
Erjia Guan
3d218e1c87 Raise warning for unpickable local function (#547) (#80232)
Summary:
X-link: https://github.com/pytorch/data/pull/547

Fixes https://github.com/pytorch/data/issues/538
- Improve the validation function to raise warning about unpickable function when either lambda or local function is provided to DataPipe.
- The inner function from functools.partial object is extracted as well for validation
- Mimic the behavior of pickle module for local lambda function: It would only raise Error for the local function rather than lambda function. So, we will raise warning about local function not lambda function.
```py

>>> import pickle
>>> def fn():
...     lf = lambda x: x
...     pickle.dumps(lf)
>>> pickle.dumps(fn)
AttributeError: Can't pickle local object 'fn.<locals>.<lambda>'
```

This Diff also fixes the Error introduced by https://github.com/pytorch/pytorch/pull/79344

Test Plan:
CI on PyTorch and TorchData
Manually validated the tests from TorchVision

Differential Revision: D37417556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80232
Approved by: https://github.com/NivekT
2022-06-27 21:47:09 +00:00
PyTorch MergeBot
fcdaf35114 Revert "Add validation for mapper function in datapipes with input_col (#79344)"
This reverts commit 787ac4edf8.

Reverted https://github.com/pytorch/pytorch/pull/79344 on behalf of https://github.com/ejguan due to This PR breaks multiple use cases and the CI from TorchVision becomes red
2022-06-24 17:17:33 +00:00
PyTorch MergeBot
79ba65c0f2 Revert "Raise warning for unpickable local function (#80140)"
This reverts commit 4b75b7d3c1.

Reverted https://github.com/pytorch/pytorch/pull/80140 on behalf of https://github.com/ejguan due to It will break the CI for TorchData
2022-06-24 14:49:06 +00:00
erjia
4b75b7d3c1 Raise warning for unpickable local function (#80140)
Fixes https://github.com/pytorch/data/issues/538

- Improve the validation function to raise warning about unpickable function when either lambda or local function is provided to `DataPipe`.
- The inner function from `functools.partial` object is extracted as well for validation
- Mimic the behavior of `pickle` module for local lambda function: It would only raise Error for the local function rather than `lambda` function. So, we will raise warning about local function not lambda function.
```py
>>> import pickle
>>> def fn():
...     lf = lambda x: x
...     pickle.dumps(lf)
>>> pickle.dumps(fn)
AttributeError: Can't pickle local object 'fn.<locals>.<lambda>'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80140
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-06-24 13:50:51 +00:00
Robert
787ac4edf8 Add validation for mapper function in datapipes with input_col (#79344)
As linked in https://github.com/pytorch/data/issues/362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79344
Approved by: https://github.com/ejguan, https://github.com/NivekT
2022-06-23 18:49:35 +00:00
erjia
ccccd0efec [DataLoader] Share seed via Distributed Store to get rid of CUDA dependency (#79829)
Fixes #79828

In distributed environment, before this PR, DataLoader would create a Tensor holding the shared seed in RANK 0 and send the Tensor to other processes. However, when `NCCL` is used as the distributed backend, the Tensor is required to be moved to cuda before broadcasted from RANK 0 to other RANKs. And, this causes the Issue where DataLoader doesn't move the Tensor to cuda before sharing using `NCCL`.

After offline discussion with @mrshenli, we think the distributed Store is a better solution as the shared seed is just an integer value. Then, we can get rid of the dependency on NCCL and CUDA when sharing info between distributed processes for DataLoader.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79829
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-06-20 19:18:35 +00:00
Kevin Tse
e8ed16f3c0 [DataPipe] Enable profiler record context in __next__ branch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79757

Approved by: https://github.com/ejguan
2022-06-17 16:52:07 +00:00
Kevin Tse
25ca006707 [DataPipe] Refactor _hook_iterator for readability
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79656

Approved by: https://github.com/ejguan
2022-06-17 16:52:07 +00:00
Robert
3064982fb8 Support percentages in random_split (#78877)
Fixes #78510

This PR adds support for using fractions with `random_split`. This should be completely backwards-compatible as the fractional-style splitting is only applied when the sum across the input lengths is lower than 1.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78877
Approved by: https://github.com/ejguan
2022-06-16 02:00:25 +00:00
Kevin Tse
22c7b1ddb5 [DataPipe] Fix error message coming from singler iterator constraint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79547

Approved by: https://github.com/ejguan
2022-06-14 21:38:36 +00:00
erjia
04f87f2ab9 [DataLoader] Fix the world_size when distributed sharding MapDataPipe (#79524)
Fixes #79449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79524
Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin
2022-06-14 19:03:57 +00:00
PyTorch MergeBot
35eda5f959 [DataPipe] Correcting deprecation version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79302

Approved by: https://github.com/ejguan
2022-06-10 19:31:29 +00:00
ErjiaGuan
5158a6b41a Foward fix sharding bug for DL (#79124)
This PR solves a bug introduced by #79041

`torch.utils.data.graph_settings.apply_sharding` changes the datapipe in-place and returns `None`

It would resolve the Error in TorchData. See: https://github.com/pytorch/data/actions/runs/2461030312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79124
Approved by: https://github.com/VitalyFedyunin
2022-06-08 16:16:58 +00:00
erjia
b3ed65343d Fix sharding strategy for distributed DL (#79041)
1. Change the sharding strategy from sharding by worker first then by rank to sharding in the order of rank then workers.
2. Change to fetch Rank and World size in main process for the sake of `spawn`.

For the change 1:
Before this PR, for the case when dataset can not be evenly divided by `worker_num * world_size`, more data will be retrieved by workers in first RANKs.
Using the following example:
- dataset size: 100
- world_size: 4
- num_worker: 2

The number of data retrieved by each rank before this PR
- Rank 0: 26
- Rank 1: 26
- Rank 2: 24
- Rank 3: 24

The number of data retrieved by each rank after this PR
- Rank 0: 25
- Rank 1: 25
- Rank 2: 25
- Rank 3: 25

For the change 2:
Before this PR, `dist` functions are invoked inside worker processes. It's fine when the worker processes are forked from the parent process. All environment variables are inherited and exposed to these `dist` functions. However, when the worker processes are spawned, they won't be able to access to these environment variables, then the dataset won't be sharded by rank.
After this PR, `_sharding_worker_init_fn` should be working for both `spawn` and `fork` case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79041
Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT
2022-06-07 20:56:32 +00:00
Kevin Tse
42fac176eb [DataPipe] Add function for deprecation of functional DataPipe names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78970

Approved by: https://github.com/ejguan
2022-06-07 00:14:47 +00:00
Kevin Tse
c44472c5b1 [DataPipe] Disable profiler for IterDataPipe by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78674

Approved by: https://github.com/VitalyFedyunin
2022-06-06 22:12:56 +00:00
Vitaly Fedyunin
6fe6902f97 [DataLoader] Apply sharding settings in dist when num_workers is 0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78950

Approved by: https://github.com/ejguan, https://github.com/NivekT
2022-06-06 20:03:02 +00:00