pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Robert	787ac4edf8	Add validation for mapper function in datapipes with `input_col` (#79344 ) As linked in https://github.com/pytorch/data/issues/362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79344 Approved by: https://github.com/ejguan, https://github.com/NivekT	2022-06-23 18:49:35 +00:00
Robert Xiu	9fca008809	[DataPipe] Adding functional API for FileLister (#78419 ) Fixes #78263 Follow-up from pytorch/data#387. This adds a functional API `list_files()` to `FileListerDataPipe`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78419 Approved by: https://github.com/NivekT, https://github.com/ejguan	2022-06-06 17:26:19 +00:00
erjia	9b6cb83b0c	Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765 ) Fixes https://github.com/pytorch/data/issues/426 This PR introduces two main changes: - It ensures the `ShufflerDataPipe` would share the same seed across distributed processes. - Users can reset `shuffle` for persistent workers per epoch. Detail: - `shared_seed` is shared across distributed and worker processes. It will seed a `shared_rng` to provide seeds to each `ShufflerDataPipe` in the pipeline - `worker_loop` now accepts a new argument of `shared_seed` to accept this shared seed. - The `shared_seed` is attached to `_ResumeIteration` for resetting seed per epoch for `persistent worker` - I choose not to touch `base_seed` simply for BC issue I used this [script](https://gist.github.com/ejguan/d88f75fa822cb696ab1bc5bc25844f47) to test the result with `world_size=4`. Please check the result in: https://gist.github.com/ejguan/6ee2d2de12ca57f9eb4b97ef5a0e300b You can see there isn't any duplicated/missing element for each epoch. And, with the same seed, the order of data remains the same across epochs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78765 Approved by: https://github.com/VitalyFedyunin	2022-06-06 17:24:00 +00:00
PyTorch MergeBot	129d9dbb15	Revert "Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765 )" This reverts commit `b769a0e18b`. Reverted https://github.com/pytorch/pytorch/pull/78765 on behalf of https://github.com/janeyx99 due to broke lint on trunk	2022-06-06 14:24:51 +00:00
erjia	b769a0e18b	Make ShufflerDataPipe deterministic for persistent DL and distributed DL (#78765 ) Fixes https://github.com/pytorch/data/issues/426 This PR introduces two main changes: - It ensures the `ShufflerDataPipe` would share the same seed across distributed processes. - Users can reset `shuffle` for persistent workers per epoch. Detail: - `shared_seed` is shared across distributed and worker processes. It will seed a `shared_rng` to provide seeds to each `ShufflerDataPipe` in the pipeline - `worker_loop` now accepts a new argument of `shared_seed` to accept this shared seed. - The `shared_seed` is attached to `_ResumeIteration` for resetting seed per epoch for `persistent worker` - I choose not to touch `base_seed` simply for BC issue I used this [script](https://gist.github.com/ejguan/d88f75fa822cb696ab1bc5bc25844f47) to test the result with `world_size=4`. Please check the result in: https://gist.github.com/ejguan/6ee2d2de12ca57f9eb4b97ef5a0e300b You can see there isn't any duplicated/missing element for each epoch. And, with the same seed, the order of data remains the same across epochs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78765 Approved by: https://github.com/VitalyFedyunin	2022-06-06 13:36:37 +00:00
Kevin Tse	b4a6730ce1	[DataPipe] Refactor 'mux' to have buffer as an instance variable Pull Request resolved: https://github.com/pytorch/pytorch/pull/77775 Approved by: https://github.com/ejguan	2022-05-19 19:55:27 +00:00
erjia	99f6e614e8	Seed `Shuffler` for MP DataLoader without explicit `manual_seed`. (#77855 ) Follow up on https://github.com/pytorch/pytorch/pull/77741 This PR guarantees the `Shuffler` in first iteration with MP DataLoader has the same seed across worker processes when users don't specify the seed. Check newly added tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/77855 Approved by: https://github.com/NivekT	2022-05-19 17:28:26 +00:00
erjia	365ce350cb	Make ShufflerDataPipe deterministic for SP & MP DataLoader (#77741 ) This is the first PR to make DataPipe deterministic. Users should be able to use `torch.manual_seed(seed)` to control the shuffle order for the following cases: - Directly over `DataPipe` - For single-process DataLoader - Multiprocessing DataLoader Unfortunately, for distributed training, users have to run `apply_shuffle_seed` manually to make sure all distributed processes having the same order of shuffle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77741 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT	2022-05-18 23:32:07 +00:00
Ning Li (Seattle)	4d1ead6dff	[DataPipe] Update `mux` data pipe (#76384 ) (#77145 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384 OSS issue discussion: https://github.com/pytorch/data/issues/346 This diff updates `mux` and `mux_longest` data pipe. `mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted. `mux` example: ``` >>> from torchdata.datapipes.iter import IterableWrapper >>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25)) >>> list(dp1.mux(dp2, dp3)) [0, 10, 20, 1, 11, 21, 2, 12, 22] ``` Test Plan: buck test mode/opt //caffe2/test:datapipe https://www.internalfb.com/intern/testinfra/testrun/4785074706282345 Differential Revision: D36017945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77145 Approved by: https://github.com/NivekT, https://github.com/ejguan	2022-05-18 16:23:07 +00:00
Kevin Tse	bbaefdf6b5	[DataPipe] Enforcing single valid iterator for IterDataPipes multiple DataPipes as outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/75995 Approved by: https://github.com/VitalyFedyunin	2022-05-18 01:31:39 +00:00
Kevin Tse	7c52f204e0	[DataPipe] Enforcing single valid iterator for IterDataPipes without multiple outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/70479 Approved by: https://github.com/ejguan	2022-05-18 01:31:38 +00:00
Vitaly Fedyunin	edffd595c2	[DataLoader] Adding ability to use dill to pass DataPipes in mutiprocessing Pull Request resolved: https://github.com/pytorch/pytorch/pull/77288 Approved by: https://github.com/ejguan, https://github.com/NivekT	2022-05-15 23:04:03 +00:00
Kevin Tse	a008d19ff7	[DataPipe] Revamp serialization logic of DataPipes Pull Request resolved: https://github.com/pytorch/pytorch/pull/74984 Approved by: https://github.com/ejguan	2022-05-10 16:16:46 +00:00
zengk95	ef63408853	Revert [DataPipe] Update mux data pipe Reverts #76384 this this is breaking tests test_demux_mux_datapipe (__main__.TestIterableDataPipeBasic. See logs: `a997046017` and was red on the PR as well: https://hud.pytorch.org/pytorch/pytorch/pull/76384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76507 Approved by: https://github.com/kit1980	2022-04-28 00:06:30 +00:00
Ning Li (Seattle)	a997046017	[DataPipe] Update `mux` data pipe (#76384 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76384 OSS issue discussion: https://github.com/pytorch/data/issues/346 This diff updates `mux` and `mux_longest` data pipe. `mux`: Yields one element at a time from each of the input Iterable DataPipes (functional name: ``mux``). As in, one element from the 1st input DataPipe, then one element from the 2nd DataPipe in the next iteration, and so on. It ends when the shortest input DataPipe is exhausted. `mux` example: ``` >>> from torchdata.datapipes.iter import IterableWrapper >>> dp1, dp2, dp3 = IterableWrapper(range(3)), IterableWrapper(range(10, 15)), IterableWrapper(range(20, 25)) >>> list(dp1.mux(dp2, dp3)) [0, 10, 20, 1, 11, 21, 2, 12, 22] ``` Test Plan: buck test mode/dev //pytorch/data/test:tests -- --exact 'pytorch/data/test:tests - test_mux_longest_iterdatapipe (test_datapipe.TestDataPipe)' https://www.internalfb.com/intern/testinfra/testrun/3096224791148107 Reviewed By: ejguan Differential Revision: D35799965 fbshipit-source-id: 320e71a342ec27e6e9200624aad42f4b99f97c3a (cherry picked from commit 741ed595275df6c05026ed6f0e78d7052328fb7d)	2022-04-27 22:10:42 +00:00
erjia	0ff05b1e97	[DataPipe] Add funtional API docstring and fix typo in test Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/76272 Approved by: https://github.com/ishaan-mehta, https://github.com/NivekT	2022-04-25 14:16:53 +00:00
Kevin Tse	383f026791	[DataPipe] Enabling graph traversal for MapDataPipe Pull Request resolved: https://github.com/pytorch/pytorch/pull/74851 Approved by: https://github.com/ejguan	2022-04-22 18:06:16 +00:00
erjia	ec591087fb	[DataPipe] Add input_col to filter and add deprecation warning for DataPipe arguments Last patch to align DataPipe API with TorchArrow DataFrame For deprecation warning of DataPipe argument: ``` The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14. See https://github.com/pytorch/data/issues/163 for details. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76060 Approved by: https://github.com/NivekT	2022-04-22 17:49:39 +00:00
erjia	b8cce8847f	[DataPipe] Add functional API to StreamReader and FileOpener Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/76233 Approved by: https://github.com/NivekT	2022-04-22 17:49:26 +00:00
erjia	841a7f5187	[DataPipe] apply dill serialization for _Demux and add cache to traverse - Fix _Demux can not be pickled with DILL presented https://github.com/pytorch/pytorch/pull/74958#issuecomment-1084637227 - And add cache to traverse function to prevent infinite recursion for circular reference of DataPipe (Fixes https://github.com/pytorch/data/issues/237) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75034 Approved by: https://github.com/wenleix	2022-04-04 19:45:14 +00:00
Kevin Tse	4c5d532728	[DataPipe] only apply special serialization when dill is installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/74958 Approved by: https://github.com/ejguan	2022-03-30 20:38:05 +00:00
Nicolas Hug	5667c4ea21	Remove default parameter of ShufflerIterDataPipe (#74370 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74370 Closes https://github.com/pytorch/data/issues/298. This PR: - removes the `default` parameter of `ShufflerIterDataPipe` - renames `set_shuffle_setting()` into `set_shuffle()` - let `set_shuffle()` return `self`. Test Plan: Imported from OSS Reviewed By: george-qi Differential Revision: D35073666 Pulled By: NicolasHug fbshipit-source-id: 9847b037e70f44f36eaf4471f2c12fa8ec2ed73c (cherry picked from commit b07ab646f308532886e8daddd57e937a53edb153)	2022-03-28 12:47:24 +00:00
Kevin Tse	eec994fc16	[DataPipe] Separating DataPipes from Dataset into different files (#73396 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396 Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation. I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more? Fixes https://github.com/pytorch/data/issues/213 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D34481962 Pulled By: NivekT fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912 (cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)	2022-03-15 14:46:34 +00:00
Kevin Tse	8811d217ed	[DataPipe] Slight refactoring IterDataPipe serialization test (#73922 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73922 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D34732288 Pulled By: NivekT fbshipit-source-id: f31229332fe4eac85cc2085484f6e1b1d802987d (cherry picked from commit ace20054e4f3f9bd9610640755400fbde82650c3)	2022-03-09 15:33:12 +00:00
Kevin Tse	0821154072	[DataPipe] Adding serialization test for all MapDataPipe (#73921 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73921 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D34732286 Pulled By: NivekT fbshipit-source-id: 893af2fbb83feb1bae226d3205105de5d3836378 (cherry picked from commit f44fd3c5210d0afdbf826e3b7e7fbe2ec216c3b7)	2022-03-09 15:33:12 +00:00
Kevin Tse	f85309e478	[DataPipe] Adding serialization test at different stages of reading for IterDataPipes (#73119 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73119 Test if a DataPipe is serializable after its contents are partially read and completely read. This is especially important for DataPipes with buffers. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D34354496 Pulled By: NivekT fbshipit-source-id: 36971d68b9ca1de81fb254e9a459b8f54fe0f9ff (cherry picked from commit e8f39a7aa364bd2b19145788f7e67c06f948f81b)	2022-02-23 16:31:21 +00:00
Kevin Tse	cd4ecce1bb	[DataPipe] Fix issue with DataPipe serialization with `dill` (#72896 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72896 Fixing the issue described here: https://github.com/pytorch/data/issues/214 There will be a follow-up PR in TorchData as well Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D34258669 Pulled By: NivekT fbshipit-source-id: 6dd88250ed14ebe779915dc46139be7e012e9d1b (cherry picked from commit 025b8ed98019e576bfef04c33a3f33ed1a426a66)	2022-02-23 16:31:20 +00:00
Erjia Guan	6297aa114f	[DataPipe] Extend FileLister to support load multiple directories (#72260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72260 Test Plan: Imported from OSS Reviewed By: dagitses, NivekT Differential Revision: D33979744 Pulled By: ejguan fbshipit-source-id: 5733d20382642fc2274afd838b33c98150d81e91 (cherry picked from commit `f70537ae76`)	2022-02-04 07:55:00 +00:00
Erjia Guan	7b014cc645	[DataPipe] Disable Typing for DataPipe before branch cut (#72123 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72123 There is a bug to fix the typing system in DataPipe, which would take more than 1 week to fix. I will follow up on it later this month. As branch cut is today, add this PR to disable typing to make sure release works. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33920610 Pulled By: ejguan fbshipit-source-id: febff849ab2272fd3b1c5127a20f27eb82992d9c (cherry picked from commit `ee103e62e7`)	2022-02-02 05:00:41 +00:00
Santiago Castro	5024c1bc7b	Make `get_file_pathnames_from_root` output order deterministic (#70435 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/70103 I used an argument so it can be disabled. I called it `deterministic_order` because `sort` can be confusing, as it's actually sorted but by dir levels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70435 Reviewed By: albanD Differential Revision: D33899755 Pulled By: ejguan fbshipit-source-id: e8a08f03a49120333b2d27f332cd21a3240a02a9 (cherry picked from commit `4616e43ec3`)	2022-02-01 18:12:23 +00:00
Vitaly Fedyunin	b36b11cbc1	Separating CaptureDataFrame out of DFIterDataPipe (#71776 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71776 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D33771602 Pulled By: VitalyFedyunin fbshipit-source-id: 59d85bc707a9568f1f0960fc184113a4f422d2df (cherry picked from commit `93522768ef`)	2022-01-26 03:25:02 +00:00
Erjia Guan	bb157dd4eb	Make methods of internal file_obj visible from StreamWrapper (#71653 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71653 Test Plan: Imported from OSS Reviewed By: NivekT Differential Revision: D33718749 Pulled By: ejguan fbshipit-source-id: f3a8244f22ca37049b8678afa0e329b23c957a9d (cherry picked from commit `a4d12ca48e`)	2022-01-25 15:34:24 +00:00
Kevin Tse	13ea2cb330	[DataPipe] Make GroupBy serializable with lambda function (#71497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71497 Related to https://github.com/pytorch/data/issues/172 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D33668749 Pulled By: NivekT fbshipit-source-id: 6506614e9d4389dc645d8985c00fdb3402122d9b (cherry picked from commit `458e76fcb1`)	2022-01-21 16:04:45 +00:00
Kevin Tse	36b4c95e74	[DataPipe] adding serialization test for all core IterDataPipes (#71456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71456 Related to https://github.com/pytorch/data/issues/172 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D33668748 Pulled By: NivekT fbshipit-source-id: ea2085d5ed47533ca49258cc52471373c6ae1847 (cherry picked from commit `d5f6fde1d0`)	2022-01-21 16:04:45 +00:00
Kevin Tse	011fd1d933	[DataPipe] improving DataPipe unit tests (#70215 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70215 A few renaming, formatting, and additional tests to make the unit tests better. cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D33344610 Pulled By: NivekT fbshipit-source-id: bb36f7452bdc44964c9ce0650c7ae308ba2c5aa5 (cherry picked from commit `0aae20cb27`)	2022-01-20 15:49:53 +00:00
Erjia Guan	fd9e08df5d	Make Demux serializable with lambda function (#71311 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71311 Test Plan: Imported from OSS Reviewed By: NivekT Differential Revision: D33584552 Pulled By: ejguan fbshipit-source-id: 52324faf5547f9f77582ec170ec91ce3114cfc61	2022-01-18 06:47:54 -08:00
Kevin Tse	1e3893ecbb	[DataPipe] Removing deprecated DataPipes (#71161 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71161 Users should import these DataPipes from [TorchData](https://github.com/pytorch/data) if they would like to use them. We will be checking for any downstream library usage before landing this PR. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D33532272 Pulled By: NivekT fbshipit-source-id: 9dbfb21baf2d1183e0aa379049ad8304753e08a1	2022-01-13 07:37:48 -08:00
Kevin Tse	8dcfdf39e7	[DataPipe] Renaming FileLoader to FileOpener with deprecation warning for FileLoader (#70367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367 This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch. Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue. cc VitalyFedyunin ejguan NivekT pmeier Nayef211 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D33301648 Pulled By: NivekT fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b	2022-01-04 09:14:50 -08:00
Kevin Tse	ad0cd8a76e	[DataPipe] Improve inline doc and testing for CollatorIterDataPipe (#70139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70139 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D33199107 Pulled By: NivekT fbshipit-source-id: f96d77490998ac9bc3da8d4ff1a9caa08e9e7f27	2021-12-20 08:05:21 -08:00
Kevin Tse	3d51c88032	[DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from MapDataPipes (#69561 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69561 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32952099 Pulled By: NivekT fbshipit-source-id: 95b725774a9d04d655e2542760726908f33043f4	2021-12-16 18:11:00 -08:00
Kevin Tse	b89c283c80	[DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from IterDataPipes (#69560 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69560 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32952100 Pulled By: NivekT fbshipit-source-id: e0cc31408c7cf3220fe274feed1c7202a1aaae70	2021-12-16 18:09:52 -08:00
Vitaly Fedyunin	d90012689f	[DataPipe] Control shuffle settings from DataLoader2 (#65756 ) Summary: Makes `shuffle` DataPipe sensitive to DataLoader(2) `shuffle` kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65756 Reviewed By: albanD Differential Revision: D31344867 Pulled By: VitalyFedyunin fbshipit-source-id: e0084e0ac193ac784d6298328ca1222745681347	2021-12-14 07:35:26 -08:00
Kevin Tse	81a60b9813	[DataPipe] Adding output types to DataPipe interface file (#69647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69647 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D32989067 Pulled By: NivekT fbshipit-source-id: 2c2e71e9e514e0d584affaa0b71b7b0d07a2ddbf	2021-12-10 12:04:45 -08:00
Kevin Tse	357160e68e	[DataPipe] Unifying API - removing nesting_level argument from FilterIterDataPipe (#69391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69391 As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `FilterIterDataPipe`. cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32849462 Pulled By: NivekT fbshipit-source-id: 91cf1dc03dd3d3cbd7a9c6ccbd791ade91355f30	2021-12-07 11:40:46 -08:00
Kevin Tse	4478b14e4c	[DataPipe] Unifying API - removing nesting_level argument from MapperIterDataPipe (#69390 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69390 As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `MapperIterDataPipe`. cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32849465 Pulled By: NivekT fbshipit-source-id: 963ce70b84a7658331d126e5ed9fdb12273c8e1f	2021-12-07 11:39:08 -08:00
Kevin Tse	6baaec30cd	[DataPipe] Adding ShufflerMapDataPipe (#68606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68606 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32813290 Pulled By: NivekT fbshipit-source-id: 8d1ebd5bc776563c23250f76a2efc1d395f1af9c	2021-12-03 11:36:33 -08:00
Kevin Tse	0465f64bb8	[DataPipe] Adding BatcherMapDataPipe (#68197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68197 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D32440963 Pulled By: NivekT fbshipit-source-id: 277cbe8d735afe341a7c189be20e1d334ecf9d4a	2021-12-02 07:27:17 -08:00
Kevin Tse	61a94495d9	[DataPipe] adding ZipperMapDataPipe (#68032 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68032 Part of #57031 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D32263058 Pulled By: NivekT fbshipit-source-id: 13a30ee9d9779284a9fd9bb7222fc41253c6fe3b	2021-11-11 10:36:05 -08:00
Kevin Tse	803e88d418	[DataPipe] Fixing pickling issues with fork and demux (#67930 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67930 Fixes #67848 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D32222184 Pulled By: NivekT fbshipit-source-id: 48871c45a855d92cd599e21f3b53827dd32c91ef	2021-11-09 07:54:02 -08:00
Jane Xu	39215ddf84	[skip ci] Set test owners for dataloader tests (#66839 ) Summary: Action following https://github.com/pytorch/pytorch/issues/66232 cc SsnL VitalyFedyunin ejguan NivekT Pull Request resolved: https://github.com/pytorch/pytorch/pull/66839 Reviewed By: ejguan Differential Revision: D31761722 Pulled By: janeyx99 fbshipit-source-id: 8315ac03352c11b3215d89856b3cfda6cd78fa0c	2021-10-19 08:31:16 -07:00
Kevin Tse	8ebe1a924d	[DataPipe] moving mux IterDataPipe test to the right location (#66277 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277 Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`. cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31485823 Pulled By: NivekT fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02	2021-10-08 08:32:29 -07:00
Kevin Tse	ed17851642	[DataPipe] adding test for IterableWrapperIterDataPipe (#66276 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66276 cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31485824 Pulled By: NivekT fbshipit-source-id: c7b21636e4b17e264bfb5dbea69cd3c477472f0b	2021-10-08 08:32:26 -07:00
Kevin Tse	e808e3d3d6	[DataPipe] adding SequenceWrapperMapDataPipe (#66275 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275 Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead. cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D31485822 Pulled By: NivekT fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340	2021-10-08 08:32:24 -07:00
Erjia Guan	a1216061c1	[DataPipe] Fix deepcopy filehandle for Mapper and in-place modification for IterableWrapper (#65220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65220 Fixes #65221 - Remove deepcopy from Mapper to support file handles - Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch) - Convert `IDP` to `IterableWrapper` in test_datapipe.py - Refine the variable names (prevent using `dp` that is module reference) Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D31021886 Pulled By: ejguan fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223	2021-09-21 14:29:40 -07:00
Erjia Guan	cf60d24028	[DataPipe] Unlimited buffer for Forker and Demultiplexer (#64994 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64994 Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D30934362 Pulled By: ejguan fbshipit-source-id: d3b774d7e28c0b9659e999511e5a68c3929857d4	2021-09-20 09:30:39 -07:00
Kevin Tse	c625f971d3	[DataPipe] Make TarArchiveReader and ZipArchiveReader accepts FileSream with attempt to close and additional warning (#64788 ) Summary: ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit). This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream. It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading. The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream. Stack: * __->__ https://github.com/pytorch/pytorch/issues/64788 * https://github.com/pytorch/pytorch/issues/64786 cc VitalyFedyunin ejguan Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788 Reviewed By: jbschlosser, ejguan Differential Revision: D30901176 Pulled By: NivekT fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e	2021-09-15 07:34:29 -07:00
Erjia Guan	c65128679b	[DataPipe] Improve Mapper to accept input/output index when apply fn (#64951 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64951 Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D30910035 Pulled By: ejguan fbshipit-source-id: d687fe10939920a3617a60552fe743e8526438a0	2021-09-14 15:46:42 -07:00
Vitaly Fedyunin	ab5e1c69a7	[WIP] Example of DataPipes and DataFrames integration (#60840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60840 Test Plan: Imported from OSS Reviewed By: wenleix, ejguan Differential Revision: D29461080 Pulled By: VitalyFedyunin fbshipit-source-id: 4909394dcd39e97ee49b699fda542b311b7e0d82	2021-09-13 18:50:15 -07:00
Kevin Tse	f3f410880a	[DataPipe] Remove ZipArchiveReader's dependency on FileLoader (#64786 ) Summary: Stack from [ghstack](https://github.com/ezyang/ghstack): * https://github.com/pytorch/pytorch/issues/64788 * __->__ https://github.com/pytorch/pytorch/issues/64786 This PR removes ZipArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream. It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading. The whole stack fixes issues related to unclosed buffer stream (see https://github.com/pytorch/pytorch/issues/64281). cc VitalyFedyunin ejguan Pull Request resolved: https://github.com/pytorch/pytorch/pull/64786 Reviewed By: ngimel Differential Revision: D30870968 Pulled By: NivekT fbshipit-source-id: 64b04d1697b99772f2fa20fc141668e6b8e18c41	2021-09-10 16:49:17 -07:00
Kevin Tse	5060b69d62	[DataPipe] fixing tests related fork() to remove warnings (#64827 ) Summary: There are two warnings produced by `test_fork_datapipe`. This PR addresses the issues raised by those warnings without impacting the test cases. cc VitalyFedyunin ejguan Pull Request resolved: https://github.com/pytorch/pytorch/pull/64827 Reviewed By: ejguan Differential Revision: D30870528 Pulled By: NivekT fbshipit-source-id: 580a001c6fa3ff6f8b04a7e5183e58861938204b	2021-09-10 11:01:42 -07:00
Kevin Tse	4ce9c530d6	[DataPipe] removing filter's inheritance from map (#64404 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404 This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like. cc VitalyFedyunin ejguan Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D30713120 Pulled By: NivekT fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c	2021-09-02 13:09:47 -07:00
Kevin Tse	4f43480186	[DataPipe] adding/removing __len__ for different DataPipe (#64398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64398 cc VitalyFedyunin ejguan Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D30710437 Pulled By: NivekT fbshipit-source-id: 524eda43a2faa0db0c1a662bf9bb4283f0ade83c	2021-09-02 13:08:32 -07:00
Kevin Tse	491bf7cb74	[DataPipe] adding description, __len__, tests for mux() (#64224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64224 cc VitalyFedyunin ejguan Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D30651551 Pulled By: NivekT fbshipit-source-id: f8af98ba71a592900b992a8077432062ec57bb48	2021-08-31 14:34:28 -07:00
Kevin Tse	0ef8760bf6	[DataPipe] implementing __len__ for fork (no valid length for demux) (#64215 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64215 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D30648672 Pulled By: NivekT fbshipit-source-id: 4780f2f6a79ae15a4009092475e7d92f96dd09a2	2021-08-31 08:32:31 -07:00
Kevin Tse	0deb7a0bc0	[DataPipe] implementing demux() (#63650 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63650 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D30493944 Pulled By: NivekT fbshipit-source-id: 0aa06dee8c7fb1744975b8f6a0694b90c11ef80d	2021-08-31 08:32:29 -07:00
Kevin Tse	eee054e6ea	[DataPipe] implementing fork() (#63649 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63649 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D30493945 Pulled By: NivekT fbshipit-source-id: 40db7d4134facd266d86bc0dc2edf2729c4e5842	2021-08-31 08:32:27 -07:00
Erjia Guan	af85bc5ffd	Replace group_by_key by group_by IterDataPipe (#64220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64220 Remove `ByKeyGrouperIterDataPipe` due to duplicated functionality. Fix a bug in `GrouperIterDataPipe` using the existing test. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D30650542 Pulled By: ejguan fbshipit-source-id: 666b4d28282fb4f49f3ff101b8d08be16a50d836	2021-08-30 18:45:44 -07:00
Erjia Guan	7946f8a9f6	Rename DataPipe to Op-er (#63325 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63325 Rename each DataPipe to an operation name ending with er. Functional API should remain `verb` such as `read_from_tar` , `shuffle`, ... (Discussed in [here](https://github.com/facebookexternal/torchdata/pull/97#discussion_r688553905)) - Batch -> Batcher - Collate -> Collator - Concat -> Concater - GroupByKey - > ByKeyGrouper ? - ListDirFiles -> FileLister - LoadFilesFromDisk -> FileLoader - Map -> Mapper - ReadFilesFromTar -> TarArchiveReader - ReadFilesFromZip -> ZipArchiveReader - ReadLinesFromFile -> LineReader - Shuffle -> Shuffler - ToBytes -> StreamReader - Transforms -> Transformer - Zip -> Zipper Let me know if you have better name for each DataPipe Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D30466950 Pulled By: ejguan fbshipit-source-id: 72909dca7b3964ab83b965891f96cc1ecf62d049	2021-08-23 14:36:10 -07:00
Erjia Guan	383a33a0eb	Make DataChunk support list in-place ops (#63422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422 Fixes #63095 Make `DataChunk` delegate to list method. Then it will support in-place operations: - `sort` - `reverse` - `append` - `extend` - `random.shuffle` Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D30379027 Pulled By: ejguan fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d	2021-08-18 08:48:47 -07:00
Erjia Guan	d1cbee7b2b	Refactor BucketBatch (#63185 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63185 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D30288893 Pulled By: ejguan fbshipit-source-id: b88b792d12a83c99d8ea9e516e3b4c54a82100f6	2021-08-16 06:42:56 -07:00
Erjia Guan	56d609d93e	Replace str by repr for DataChunk (#63184 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63184 Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D30288892 Pulled By: ejguan fbshipit-source-id: 45c88fdd3987e234f2c22ebbbfd8d5044983c34c	2021-08-16 06:41:38 -07:00
Shen Li	1022443168	Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: revert-hammer Differential Revision: D30279364 (`b004307252`) Original commit changeset: c1ed77dfe43a fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e	2021-08-12 11:45:01 -07:00
Zsolt Dollenstein	b004307252	[codemod][lint][fbcode/c*] Enable BLACK by default Test Plan: manual inspection & sandcastle Reviewed By: zertosh Differential Revision: D30279364 fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a	2021-08-12 10:58:35 -07:00
Vitaly Fedyunin	d3bdf345cb	Introducing DataChunk for DataPipes batching (#62768 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62768 This is part of TorchArrow DF support preparation, separating it to multiple PRs to simplify review process. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D30149090 Pulled By: VitalyFedyunin fbshipit-source-id: a36b5ff56e2ac6b06060014d4cd41b487754acb8	2021-08-06 08:38:33 -07:00
Vitaly Fedyunin	4ef640d6f6	Sort imports of test_datapipe.py (#61312 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61312 Sorting according to isort output. Alphabetically ordered one per line imports help merging. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D29588833 Pulled By: VitalyFedyunin fbshipit-source-id: 4c80c3086132b50894e734ad6c5799d78d689e42	2021-07-12 15:33:20 -07:00
Vitaly Fedyunin	fd13e925ec	Adding backward compatibility for sharding support in old DataLoader (#61237 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61237 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D29588832 Pulled By: VitalyFedyunin fbshipit-source-id: 3bfa4417f6a04450f656ecf28fc95322d2cf076a	2021-07-12 14:53:45 -07:00
Vitaly Fedyunin	d3cb065b2f	Implement usage of `is_shardable` and `apply_sharding` (#61236 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61236 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D29588835 Pulled By: VitalyFedyunin fbshipit-source-id: 00c3042f96af498637b2dcf6e3f842c1fc05ddd8	2021-07-12 14:23:20 -07:00
Vitaly Fedyunin	f2857883c4	Add DataPipes Graph Functions (#61235 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61235 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D29588834 Pulled By: VitalyFedyunin fbshipit-source-id: e0331d6e1fc2a3f8b6211aac83965bcf13165161	2021-07-12 10:28:35 -07:00
Vitaly Fedyunin	99959fe3f5	[DataLoader] Adding demux and mux DataPipe-s (#61234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61234 * #61234 [WIP] Adding demux and mux DataPipe API examples Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D29588836 Pulled By: VitalyFedyunin fbshipit-source-id: 523d12ea6be7507d706b4c6d8827ec1ac4ccabc3	2021-07-12 10:04:03 -07:00
zilinzhu	c19adfff54	[DataLoader] Introduce ConcatMapDataPipe functional datapipe (#61010 ) Summary: As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the ConcatMapDataPipe functional datapipe for the MapDataPipe class. We may need to discuss how to treat the datapipes with no valid length. For now, I just use them as if they have infinite length and the `__getitem__` could not go pass them. Thank you for your time on reviewing this~ cc ejguan Pull Request resolved: https://github.com/pytorch/pytorch/pull/61010 Reviewed By: soulitzer Differential Revision: D29587679 Pulled By: ejguan fbshipit-source-id: 5eb97fa727209bec6c534520057c64a78000626e	2021-07-09 09:29:18 -07:00
Vitaly Fedyunin	a652398465	[DataLoader] Rename transform DataPipe to legacy_transform (#60670 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60670 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D29461081 Pulled By: VitalyFedyunin fbshipit-source-id: 57f53a91db9032a6126e86243ddea9149c473060	2021-06-30 09:49:14 -07:00
Kevin Tse	df8a8fbc1b	Improve code and documentation clarity for DataPipes APIs (#60423 ) Summary: Fixes issues that are discussed with ezyang in the comments of PR https://github.com/pytorch/pytorch/issues/59498 Improved code and documentation clarity, and refactored .filter to nesting_level directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/60423 Reviewed By: ezyang Differential Revision: D29281599 Pulled By: NivekT fbshipit-source-id: a9bbaf52f492db0741c00f3ceb4022b08ddb1506	2021-06-22 11:19:08 -07:00
Jiong Gu	a120a12ab4	[Bootcamp][pytorch]Add WebIterDataPipe and ToBytesIterDataPipe to the datapipes. (#59816 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59816 Add two new DataPipes, one for getting web file urls to yield streams and one for getting streams to yield bytes. Test Plan: Add test_web_iterable_datapipe in test/test_datapipes.py. The test initiates a local http server for serving test files. Test below locally ok. 1. create and load 16M localhost file urls (each of size 10 Bytes) 2. create and load a 64GB localhost file in the unit test, for sake of testing time, disabling both stress test and large file test Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D29051186 fbshipit-source-id: f8e44491e670560bf445af96f94d98230436f396	2021-06-15 11:43:26 -07:00
Erjia Guan	e7ad82eb2f	[DataLoader] Add option to refine type during runtime validation for DP instance (#56066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56066 Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D27776646 Pulled By: ejguan fbshipit-source-id: 695ff7775177653d809c5917d938c706281e1298	2021-06-10 14:04:24 -07:00
Kevin Tse	fa030d1213	[DataPipes] Add simple unbatch to DataPipe (#59610 ) Summary: Implements the simple unbatch feature for DataPipe https://github.com/pytorch/pytorch/issues/58148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/59610 Reviewed By: VitalyFedyunin Differential Revision: D28994180 Pulled By: NivekT fbshipit-source-id: 4bafe6e26c4f95a808c489b147369413a196fa1c	2021-06-09 16:53:31 -07:00
Kevin Tse	12b4e8996f	[DataLoader] Add nesting_level argument to map and filter (#59498 ) Summary: This PR implements the .map and .filter APIs for IterDataPipe. [DataPipes] Make .map of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58145 [DataPipes] Make .filter of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/59498 Reviewed By: ejguan Differential Revision: D28964280 Pulled By: NivekT fbshipit-source-id: b1ee6cafa3953093ebd7bf30eacc80c3ef7cd190	2021-06-09 07:40:53 -07:00
Erjia Guan	5c7e14d2bc	[DataLoader] Switch NotImplementedError to TypeError for len (#59464 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59464 Fixes #59378 Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D28944447 Pulled By: ejguan fbshipit-source-id: 8b3d53a1863b41e578d56f219e452d18d7eae0d8	2021-06-08 07:16:18 -07:00
Erjia Guan	1b578c4bf5	[DataLoader] Close byte stream explicitly (#58938 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58938 When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems: - Performance regression because `gc` requires additional memory and computation to track reference - Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation To reduce the warning: - Explicitly close byte stream - Modify `test_datapipe.py` to use context manager Small fix: - Reorder import in `test_datapipe.py` Further investigation: Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error? - Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work. - We may need to implement some reference counter attached to these file byte stream to close by itself. Test Plan: Imported from OSS Reviewed By: jbschlosser Differential Revision: D28689862 Pulled By: ejguan fbshipit-source-id: bb2a85defb8a4ab5384db902ef6ad062185c2653	2021-06-08 07:15:08 -07:00
Erjia Guan	0e16087064	[DataLoader] Fix bugs for typing (#58450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58450 Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D28507877 Pulled By: ejguan fbshipit-source-id: f4051ff51ce77ef45214f11cba10c8a7e1da4dad	2021-05-24 07:14:40 -07:00
Marcio Porto	4942fe0290	[DataLoader] Introduce MapMapDataPipe functional datapipe (#58258 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58258 As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the `MapMapDataPipe` functional datapipe for the `MapDataPipe` class. Usage: ``` def fn(x): return x * 10 dp = CountingDataset(n=10) dp.map(fn) ``` Reviewed By: ejguan Differential Revision: D28394510 fbshipit-source-id: 8d71b1f5723dff52385c3ce753944304896af678	2021-05-20 09:00:21 -07:00
Erjia Guan	3b977b3b4d	[DataLoader] Add context manager for runtime type validation (#55936 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55936 Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D27743476 Pulled By: ejguan fbshipit-source-id: 8f0454ccf3ec37807598056433bff91013fa9bb9	2021-05-12 11:59:16 -07:00
Erjia Guan	5c696443c7	[DataLoader] Modfity construct_time_validation to argument_validation (#55836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55836 Change construct_time_validation to argument_validation as we should provide users the flexibility to use this decorator over all different functions, which are required with type validation. It can also work as a construct-time validation ```py class ExampleDataPipe(IterDataPipe): argument_validation def __init__(self, dp: IterDataPipe[int]): self.dp = dp ... ``` Notebook is also updated. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D27743478 Pulled By: ejguan fbshipit-source-id: 49743152d121028cd7d72d89dc7df5c7c7b94c41	2021-05-12 11:58:05 -07:00
Erjia Guan	b58a7c95aa	[DataLoader] Raise detailed Error for ForwardRef type (#57824 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57824 Implement type check for string type. Re-raise detailed exception at compile time. ```py >>> class InvalidData(Generic[T_co], NamedTuple): # Invalid generic namedtuple in Python typing ... name: str ... data: T_co class DP(IterDataPipe['InvalidData[int]']): ... pass TypeError: InvalidData[int] is not supported by Python typing ``` Add `__type_class__` attribute to class, which optimizes the static checking flow by reducing checking times. ```py >>> class DP1(IterDataPipe[Union[int, str]]): ... pass >>> class DP2(DP1[int]): ... pass >>> list((cls, getattr(cls, '__type_class__', None)) for cls in DP2.__mro__) [(<class '__main__.DP2'>, False), (<class 'abc.DP1[int]'>, True), (<class '__main__.DP1'>, False), (<class 'abc.IterableDataset[typing.Union[int, str]]'>, True), (<class 'torch.utils.data.dataset.IterableDataset'>, False), (<class 'torch.utils.data.dataset.Dataset'>, None), (<class 'typing.Generic'>, None), (<class 'object'>, None)] ``` Among the class of `DP2`'s MRO, only `DP2`, `DP1` will be static checked when `__type_class__` is `False`. `abc.DP1[int]` and `abc.IterableDataset[typing.Union[int, str]]` will be ignored since they are just a class with typing. ## Future When Python 3.6 is deprecated, using TypeAlias rather than TypeMeta can eliminates the usage of `__type_class__` attribute. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D28289104 Pulled By: ejguan fbshipit-source-id: 1da97460c8bfc48cea7396033fde484a24caba7c	2021-05-11 13:38:30 -07:00
Erjia Guan	ece15f6902	[DataLoader] Change Decoder signature and add MatHandler (#57391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57391 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D28151601 Pulled By: ejguan fbshipit-source-id: 34814197d2f068cab0c7ca2330152ad588eb1ef0	2021-05-10 06:29:00 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Erjia Guan	0b1c3dfae4	[DataLoader] Typing Enforcement for DataPipe at runtime (#54544 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54544 ## Feature - Add `subinstance(data, type)` to check `data` is a subtype instance of the `type` - Add a decorator of `runtime_validation` to validate the returned data from `__iter__` is subtype instance of hint. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D27327234 Pulled By: ejguan fbshipit-source-id: fb6a332762b0fe75284bb2b52a13ed171b42558c	2021-04-02 15:22:32 -07:00
Erjia Guan	1535520f08	[DataLoader] Typing Enforcement for DataPipe at construct-time (#54066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54066 ## Feature - Add a decorator `construct_time_validation` to validate each input datapipe according to the corresponding type hint. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D27327236 Pulled By: ejguan fbshipit-source-id: a9d4c6edb5b05090bd5a369eee50a6fb4d7cf957	2021-04-02 15:22:29 -07:00
Erjia Guan	44edf8c421	[DataLoader] Typing Enforcement for DataPipe at Compile-time (#54020 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54020 ## Feature - Add `issubtype` to check the type is a subtype of the other type. - Add `_DataPipeMeta` (mimic Python typing 3.6) - Add `type` attribute for each DataPipe - Save original `__init__` function for each DataPipe - Validate return hint of `__iter__` - Replace `__init__` function bases on `type` - Fixed type: Put original `__init__` back, if it exists or use a plain `__init__` - Non-fixed type: Add new `__init__` with the functionality to copy `cls.type` for each instance. (Optimized for memory) No Error for main repo, `torchvision`, `torchaudio` and `torchtext`. ## Future - Add same thing for `__getitem__`. - When DataFrame came out, add an another type for DataFrame with column name and type. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D27327232 Pulled By: ejguan fbshipit-source-id: fd3a6029c16f5d814b1d7e1b1566fdcd8fd1ad9a	2021-04-02 15:22:27 -07:00
Erjia Guan	560e3be587	[DataLoader] Implement issubtype for type hints (#54299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54299 ## Feature - Check type is a subtype of another type Prerequisite for DataPipe tying system. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D27327235 Pulled By: ejguan fbshipit-source-id: 8f50a663a86540677c9e132ac7c5216fdac46f70	2021-04-02 15:20:55 -07:00
Erjia Guan	fff0a3f906	[DataLoader] ZipIterDataPipe (#53554 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53554 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26913406 Pulled By: ejguan fbshipit-source-id: 24604b41d08eb6f7689add152229049a4c65c06e	2021-03-12 08:26:21 -08:00

1 2 3 4

165 Commits