Commit Graph

70 Commits

Author SHA1 Message Date
Ramil Nugmanov
91eeb77260 StackDataset batched sampling (#110694)
Optimization of loading minibatches

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110694
Approved by: https://github.com/ejguan
2023-10-10 22:05:51 +00:00
Navid Sheik
96a3a7cc82 [pytorch] make IterableDataset of Iterable type (#109645)
Summary: Makes `IterableDataset` of `Iterable` type.

Test Plan: tests next diff in the stack are all green

Differential Revision: D49420146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109645
Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007
2023-09-25 14:18:15 +00:00
katotaisei
bcda859e34 fix typos (#108006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108006
Approved by: https://github.com/Skylion007
2023-08-28 19:49:09 +00:00
Nikita Shulga
5837e95d30 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`

Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-15 20:30:20 +00:00
PyTorch MergeBot
15fd1ea118 Revert "[Reland] Update mypy to 1.4.1 (#105227)"
This reverts commit c9c4f8efc3.

Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))
2023-07-14 22:28:35 +00:00
Nikita Shulga
c9c4f8efc3 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-14 20:45:12 +00:00
PyTorch MergeBot
3c5a494d7a Revert "Update mypy to 1.4.1 (#91983)"
This reverts commit 634659e262.

Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))
2023-07-14 15:59:16 +00:00
Nikita Shulga
634659e262 Update mypy to 1.4.1 (#91983)
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  -
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi
2023-07-13 16:30:36 +00:00
Ramil Nugmanov
28098cae6b [DataLoader] Adding StackDataset (#101338)
Torch wrapping datasets list has:
`TensorDataset`
`ConcatDataset`
`ChainDataset`

`TensorDataset` is useful for stacking sets of tensors but can't work with objects without `.size()` method.

This PR proposes `StackDataset`, similar to `TensorDataset` but for a general case like `ConcatDataset`.

Possible usage of `StackDataset` is multimodal networks with different input like image+text or for staking non-tensor input and property to predict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101338
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-05-18 00:57:12 +00:00
Ramil Nugmanov
a2e81a8004 [DataLoader] __getitems__ added to description of Dataset API and better supported within Subset (#100375)
DataLoader supports batched loading from Mapped Datasets.

This is the fetcher's implementation of auto-detection of batch loading support.

torch.utils.data._utils.fetch._MapDatasetFetcher
```
class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]
```

Description of Dataset API now shows this feature.

Additionally, Subset dataset now supports `__getitems__` if parent dataset supports it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100375
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-05-05 15:52:28 +00:00
Chase
2f41bc5465 [DataLoader] Add context to NotImplementedErrors in dataset.py (#100667)
Add helpful context message to `NotImplementedError`'s thrown by Dataset and IterableDataset, reminding users that they must implement `__getitem__`/`__iter__` in subclasses. Currently, users are presented with a bare `NotImplementedError` without describing the remedy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100667
Approved by: https://github.com/NivekT
2023-05-05 02:16:42 +00:00
Sergii Dymchenko
46faa79e09 Simplify by using yield from in torch/utils/data (#97839)
Also see https://github.com/pytorch/pytorch/pull/97831
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97839
Approved by: https://github.com/NivekT, https://github.com/Skylion007
2023-03-29 04:51:26 +00:00
Xuehai Pan
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
joncrall
b136f3f310 More doctest refinements. (#83317)
Follow up to #82797

Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way.

@ezyang @vadimkantorov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317
Approved by: https://github.com/ezyang
2022-08-22 20:07:26 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Robert
3064982fb8 Support percentages in random_split (#78877)
Fixes #78510

This PR adds support for using fractions with `random_split`. This should be completely backwards-compatible as the fractional-style splitting is only applied when the sum across the input lengths is lower than 1.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78877
Approved by: https://github.com/ejguan
2022-06-16 02:00:25 +00:00
Erjia Guan
0289ab2cec Fix data-related public API (#368)
Summary:
X-link: https://github.com/pytorch/data/pull/368

This is PR aims to expose the right data-relate API.

There are two more changes made in this PR to convert public api to private api
`check_lambda_fn` -> `_check_lambda_fn`
`deprecation_warning` -> `_deprecation_warning`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76143

Reviewed By: albanD, NivekT

Differential Revision: D35798311

Pulled By: ejguan

fbshipit-source-id: b13fded5c88a533c706702fb2070c918c839dca4
(cherry picked from commit 0b534b829a2e90e1e533951c6d334fdeaa9358b9)
2022-04-21 17:27:05 -07:00
Kevin Tse
eec994fc16 [DataPipe] Separating DataPipes from Dataset into different files (#73396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396

Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.

I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?

Fixes https://github.com/pytorch/data/issues/213

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34481962

Pulled By: NivekT

fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
(cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)
2022-03-15 14:46:34 +00:00
Kevin Tse
cd4ecce1bb [DataPipe] Fix issue with DataPipe serialization with dill (#72896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72896

Fixing the issue described here: https://github.com/pytorch/data/issues/214

There will be a follow-up PR in TorchData as well

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D34258669

Pulled By: NivekT

fbshipit-source-id: 6dd88250ed14ebe779915dc46139be7e012e9d1b
(cherry picked from commit 025b8ed98019e576bfef04c33a3f33ed1a426a66)
2022-02-23 16:31:20 +00:00
Kevin Tse
f5e201e4e9 [DataPipe] Adding usage examples for IterDataPipes (#73033)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73033

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34313793

Pulled By: NivekT

fbshipit-source-id: 51125be2f79d73d02658b2b1c2691f96be8d4769
(cherry picked from commit 3e3c2df7c6)
2022-02-18 15:12:34 +00:00
Kevin Tse
3e1eff9a0e [DataPipe] Add docstrings for IterDataPipe and MapDataPipe, along with small doc changes for consistency (#72618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72618

The major changes are in torch/utils/data/dataset.py

Let me know if anything is unclear. I'm open to suggestion.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D34119492

Pulled By: NivekT

fbshipit-source-id: 358cb6d33d18501f9042431350f872ebaa9b4070
(cherry picked from commit 53b484f60a)
2022-02-10 16:25:36 +00:00
Erjia Guan
0721fc6474 Decouple MapDataPipe from Dataset (#70991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70991

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33477680

Pulled By: ejguan

fbshipit-source-id: d3e89492e921a96791319f35052a229684ddf7cf
2022-01-07 14:28:41 -08:00
Vitaly Fedyunin
d90012689f [DataPipe] Control shuffle settings from DataLoader2 (#65756)
Summary:
Makes `shuffle` DataPipe sensitive to DataLoader(2) `shuffle` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65756

Reviewed By: albanD

Differential Revision: D31344867

Pulled By: VitalyFedyunin

fbshipit-source-id: e0084e0ac193ac784d6298328ca1222745681347
2021-12-14 07:35:26 -08:00
Tim Poulsen
486ae5c733 Dataset & IterableDataset attribute errors prints attribute (#69021)
Summary:
The message is the message from a standard attribute error.
Thought it would be informative when the error is thrown.
Alternatively in python 3.10, one can set the keyword arguments 'name' and 'obj',
reference: https://github.com/python/cpython/blob/3.10/Doc/library/exceptions.rst#concrete-exceptions

Fixes #{?}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69021

Reviewed By: samdow

Differential Revision: D32730362

Pulled By: ejguan

fbshipit-source-id: 7132ba612fa6075aeffb9315ce651828e9a8e0bc
2021-12-01 10:16:31 -08:00
Erjia Guan
4c87aa77d1 [DataPipe] Traverse DataPipe graph excluding primitive and callable (#67783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67783

Add `getstate_hook` to exclude primitive objects and callable when serialization when `exclude_primitive` is enabled for `traverse`.
For graph traversing, we don't have to handle the lambda and other stuff.
This is used by `OnDiskCacheHolder` to trace the DataPipe Graph.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32146697

Pulled By: ejguan

fbshipit-source-id: 03b2ce981bb21066e807f57c167b77b2d0e0ce61
2021-11-15 06:46:31 -08:00
Kevin Tse
0b09d62cf3 [hackathon][DataPipe] adding .pyi file generation for torch.utils.data.datapipes (#67374)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* __->__ https://github.com/pytorch/pytorch/issues/67374

This is a work in progress.

Related TorchData issue: https://github.com/pytorch/data/issues/80

cc VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67374

Reviewed By: H-Huang

Differential Revision: D32153211

Pulled By: NivekT

fbshipit-source-id: b4c61f191f20fd98ca44bb9e4f972c6d812994a0
2021-11-08 14:43:24 -08:00
Vitaly Fedyunin
ab5e1c69a7 [WIP] Example of DataPipes and DataFrames integration (#60840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60840

Test Plan: Imported from OSS

Reviewed By: wenleix, ejguan

Differential Revision: D29461080

Pulled By: VitalyFedyunin

fbshipit-source-id: 4909394dcd39e97ee49b699fda542b311b7e0d82
2021-09-13 18:50:15 -07:00
Santiago Castro
69f4401b7b Make datasets in ConcatDataset not need to be sized (#64114)
Summary:
`datasets` needs to be iterable, but also sized because the length is checked. But immediately after it's converted to a list. By changing the order of these 2 lines, it doesn't need to be sized anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64114

Reviewed By: H-Huang

Differential Revision: D30641480

Pulled By: ejguan

fbshipit-source-id: 7e16548c2123afa65b83845f9929271fa07fe1e8
2021-09-01 15:32:50 -07:00
Santiago Castro
6b85c99ce5 Avoid an unnecessary list creation in DataChunk (#64111)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64111

Reviewed By: mruberry

Differential Revision: D30639383

Pulled By: ezyang

fbshipit-source-id: 96b243307413c99a67d55d862a71937e1ef210f4
2021-08-30 19:25:42 -07:00
Erjia Guan
383a33a0eb Make DataChunk support list in-place ops (#63422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422

Fixes #63095

Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30379027

Pulled By: ejguan

fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d
2021-08-18 08:48:47 -07:00
Erjia Guan
56d609d93e Replace str by repr for DataChunk (#63184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63184

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288892

Pulled By: ejguan

fbshipit-source-id: 45c88fdd3987e234f2c22ebbbfd8d5044983c34c
2021-08-16 06:41:38 -07:00
Vitaly Fedyunin
d3bdf345cb Introducing DataChunk for DataPipes batching (#62768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62768

This is part of TorchArrow DF support preparation, separating it to multiple PRs to simplify review process.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30149090

Pulled By: VitalyFedyunin

fbshipit-source-id: a36b5ff56e2ac6b06060014d4cd41b487754acb8
2021-08-06 08:38:33 -07:00
Vitaly Fedyunin
171598f0e3 [Refactoring] Fix imports order for torch/utils/data/dataset.py (#61328)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61328

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29588897

Pulled By: VitalyFedyunin

fbshipit-source-id: 63df653fb471532819c83ebcee4f9dc951500ffb
2021-07-22 08:30:08 -07:00
Simon Seo
b505adbb09 Fix typo in ChainDataset docs (#60336)
Summary:
* chainning -> chaining

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60336

Reviewed By: bdhirsh

Differential Revision: D29265236

Pulled By: anjali411

fbshipit-source-id: 17a9b73af9e094550bd1ee25bc9439fb8d455e2b
2021-06-21 11:47:21 -07:00
Philip Meier
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
TJ-coding
7c29ca7f2b Fix Subset of a Subset not sliceable issue (#59513)
Summary:
Dataset can be indexed by a list, but a list can not be indexed by a list. This gives error when slicing a Subset initialised with a Subset, instead of a dataset.

Fixed the issue by changing the indices to a Tensor which can be indexed by a list.

Fixes https://github.com/pytorch/pytorch/issues/59512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59513

Reviewed By: zou3519

Differential Revision: D29196891

Pulled By: ejguan

fbshipit-source-id: ccde6e474fbcbddd2e9c7c107bc8b5de1307cdb9
2021-06-18 07:07:34 -07:00
Erjia Guan
5c7e14d2bc [DataLoader] Switch NotImplementedError to TypeError for len (#59464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59464

Fixes #59378

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28944447

Pulled By: ejguan

fbshipit-source-id: 8b3d53a1863b41e578d56f219e452d18d7eae0d8
2021-06-08 07:16:18 -07:00
Marcio Porto
4942fe0290 [DataLoader] Introduce MapMapDataPipe functional datapipe (#58258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58258

As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the `MapMapDataPipe` functional datapipe for the `MapDataPipe` class.

Usage:
```
def fn(x):
    return x * 10

dp = CountingDataset(n=10)
dp.map(fn)
```

Reviewed By: ejguan

Differential Revision: D28394510

fbshipit-source-id: 8d71b1f5723dff52385c3ce753944304896af678
2021-05-20 09:00:21 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Erjia Guan
44edf8c421 [DataLoader] Typing Enforcement for DataPipe at Compile-time (#54020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54020

## Feature
- Add `issubtype` to check the type is a subtype of the other type.
- Add `_DataPipeMeta` (mimic Python typing 3.6)
  - Add `type` attribute for each DataPipe
  - Save original `__init__` function for each DataPipe
  - Validate return hint of `__iter__`
  - Replace `__init__` function bases on `type`
    - Fixed type: Put original `__init__` back, if it exists or use a plain `__init__`
    -  Non-fixed type: Add new `__init__` with the functionality to copy `cls.type` for each instance. (Optimized for memory)

No Error for main repo, `torchvision`, `torchaudio` and `torchtext`.

## Future
- Add same thing for `__getitem__`.
- When DataFrame came out, add an another type for DataFrame with column name and type.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27327232

Pulled By: ejguan

fbshipit-source-id: fd3a6029c16f5d814b1d7e1b1566fdcd8fd1ad9a
2021-04-02 15:22:27 -07:00
Vitaly Fedyunin
8ecb2d35bc Add ability to override _reduce_ex_ function of DataPipe (#52858)
Summary:
Required for `torchdata` graph functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52858

Reviewed By: H-Huang

Differential Revision: D26736348

Pulled By: VitalyFedyunin

fbshipit-source-id: 1735e88374090422e6365d07d5b84075e371500c
2021-03-17 17:23:05 -07:00
Erjia Guan
dbbe0a2105 [DataLoader] Introduce deterministic context (#53271)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53271

- [x] Add `set_determinism` context manager
- [x] Add `non_deterministic` decorator for `DataPipe`
  - Raise error at the construction time for non-deterministic DataPipe when `determinism` is set to `True`
 - [ ] Support `non_deterministic` with option
   - When `GreedyJoin` only contains one datapipe, it should still be deterministic.

Note: Test is in the [PR](https://github.com/facebookexternal/torchdata/pull/15). As the main repo doesn't have non-deterministic DataPipe yet.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D26823023

Pulled By: ejguan

fbshipit-source-id: 51bb92fc3d18d1fc9536c1229363c536ad120876
2021-03-06 07:37:39 -08:00
Erjia Guan
89b1053413 [DataLoader] Move BufferedShuffle from Dataset to DataPipe (#52141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52141

Remove BufferShuffleDataSet, as it's not being used anywhere within PyTorch (no usage on Github based on a search) and it's not included in the release of PyTorch 1.7.1.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26710940

Pulled By: ejguan

fbshipit-source-id: 90023b4bfb105d6aa392753082100f9181ecebd0
2021-03-01 12:54:44 -08:00
Vitaly Fedyunin
9a03e65456 Adding functional way of stacking DataPipes with fixed mypy (#52885)
Summary:
Readding reverted PR with MyPY fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52885

Reviewed By: ejguan

Differential Revision: D26676405

Pulled By: VitalyFedyunin

fbshipit-source-id: 020216c5522d21a4994cd896ae778c0b77f6444b
2021-02-25 19:37:35 -08:00
Howard Huang
da732c76c4 Revert D26644079: [pytorch][PR] Adding functional way of stacking DataPipes
Test Plan: revert-hammer

Differential Revision:
D26644079 (7972036bbb)

Original commit changeset: dcf464637b4f

fbshipit-source-id: a12a06d7e7fb3821a0990bbc6305d02721ead82c
2021-02-25 14:30:49 -08:00
Vitaly Fedyunin
7972036bbb Adding functional way of stacking DataPipes (#52507)
Summary:
Allows to use functional API to stack datapipes:
```python
numbers_dp = NumbersDataset(size=10).filter(filter_fn = lambda x: x % 2 == 1).map(fn = lambda x: x * 10)
```

DataPipes have to be decorated with:
```python
functional_datapipe('map')
class MapIterDataPipe(IterDataPipe[T_co]):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52507

Reviewed By: ailzhang

Differential Revision: D26644079

Pulled By: VitalyFedyunin

fbshipit-source-id: dcf464637b4fcf9ea1eb8e84c2a0cd4dfd58b43d
2021-02-25 11:22:01 -08:00
Samuel Marks
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
Mohamad H. Danesh
d1d6dc2e3c Add more specific error message (#46905)
Summary:
While using `torch.utils.data.TensorDataset`, if sizes of tensors mismatch, there's now a proper error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46905

Reviewed By: ngimel

Differential Revision: D24565712

Pulled By: mrshenli

fbshipit-source-id: 98cdf189591c2a7a1b693627cc8464e8f553d9ee
2020-10-30 16:03:44 -07:00
Erjia Guan
96540e918c Add ShuffleDataset with buffer (#45290)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45290

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D24001084

Pulled By: erjia-guan

fbshipit-source-id: d8a7455cf3f18e1f8c1edc53c42c1a99c8573c51
2020-09-30 07:58:15 -07:00