pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Kevin Tse	64a526d4af	[DataLoader] Replacing `traverse` function with `traverse_datapipes` (#85667 ) This PR deprecates `traverse` function and replaces it with `traverse_datapipes` instead. While use `DataLoader`, I realized that it is raising `FutureWarning` even though I am not explicitly using `traverse`. What is happening is that `DataLoader` invokes `traverse(dp, only_datapipe=True)`, and the usage of the keyword causes the `only_datapipe` warning to be raised. ``` /home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/utils/data/graph.py:102: FutureWarning: `only_datapipe` is deprecated from `traverse` function and will be removed after 1.13. warnings.warn(msg, FutureWarning) ``` A few things we'd like to do: 1. Deprecate the key word arg `only_datapipe` 2. Change the default behavior from `only_datapipe=False` to `only_datapipe=True` in the future 3. Do not raise a warning when users are using the function correctly This creates a paradox it is impossible for the users to change their code to match the future default behavior (i.e. call `traverse(dp)` without `only_datapipe`): - they cannot do so because the default behavior of `traverse` hasn't changed yet, so they must use `only_datapipe=True` - if they use `only_datapipe=True`, eventually the kwarg will go away and cause a runtime error; they also get a `FutureWarning` in the present IIUC, there doesn't seem to be a way to accomplish those 3 goals without replacing the function with a new one that has a different name; hence, this PR. Let me know if there is a better alternative. If this looks right, I will send a follow up PR in `TorchData`. Differential Revision: [D39832183](https://our.internmc.facebook.com/intern/diff/D39832183) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85667 Approved by: https://github.com/ejguan	2022-09-27 19:58:15 +00:00
Erjia Guan	f1a6f32b72	[DataLoader] Make distributed lazily initialized & share seed via PG (#85279 ) Fixes #84492 https://github.com/pytorch/data/issues/772 ## Changes - Move the logic of distributed sharding from the constructor of DataLoader to the constructor of DataLoaderIterator. This would prevent the Error caused by lazy distributed process initialization - Replace distributed store by process group (`gloo`) to share the random seed because `mpi` backend doesn't provide distributed store. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85279 Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin	2022-09-23 18:52:52 +00:00
Erjia Guan	ea72a0991c	Add support to traverse all python collection objects (#84079 ) Fixes https://github.com/pytorch/data/issues/752 This PR makes `traverse` function supporting more collections data structures from Python. The `getstate_hook` will be invoked after custom `__getstate__` function. This would guarantee that `traverse` function will be working as long as the `DataPipe` is working properly with multiprocessing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079 Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin	2022-09-23 16:21:25 +00:00
Xu Zhao	52a2b61203	Fix fetch function which breaks user code (#85099 ) The [fastNLP](https://github.com/fastnlp/fastNLP/blob/v0.6.0/fastNLP/core/batch.py#L51) model uses DataSetGetter to fetch data from the dataset. The following code breaks because of https://github.com/pytorch/pytorch/pull/84301: ``` from fastNLP.io.pipe.qa import CMRC2018BertPipe input_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), ".data", "cmrc2018-sim") data_bundle = CMRC2018BertPipe().process_from_file(paths=input_dir) data_bundle.rename_field('chars', 'words') data_bundle.get_dataset('dev') dataset = DataSetGetter(dataset, as_numpy) dataiter = torch.utils.data.DataLoader(dataset=dataset) for batch in dataiter: # data-processing... ``` This is because for the `DataSetGetter` class, the following condition holds: ``` # hasattr(dataset_getter, '__getitems__') == True # dataset_getter.__getitems__ == None ``` This PR adds an additional check to make sure `__getitems__` is only called when it is not None. This error was found by the torchbench nightly CI, original error stack trace: ``` ERROR: test_fastNLP_Bert_train_cuda (__main__.TestBenchmark) ---------------------------------------------------------------------- components._impl.workers.subprocess_rpc.ChildTraceException: Traceback (most recent call last): File "/home/circleci/project/components/_impl/workers/subprocess_rpc.py", line 470, in _run_block exec( # noqa: P204 File "<subprocess-worker>", line 35, in <module> File "<subprocess-worker>", line 12, in _run_in_worker_f File "/home/circleci/project/torchbenchmark/util/model.py", line 16, in __call__ obj = type.__call__(cls, args, *kwargs) File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 93, in __init__ self.example_inputs = self._prefetch(example_inputs) File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 133, in _prefetch for batch_x, batch_y in example_inputs: File "/home/circleci/miniconda3/lib/python3.8/site-packages/fastNLP/core/batch.py", line 266, in __iter__ for indices, batch_x, batch_y in self.dataiter: File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__ data = self._next_data() File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 719, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch data = self.dataset.__getitems__(possibly_batched_index) TypeError: 'NoneType' object is not callable ``` Full error log: https://app.circleci.com/pipelines/github/pytorch/benchmark/5143/workflows/0676f36d-0ab4-42bd-adb4-90e6b0df76d1/jobs/5293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85099 Approved by: https://github.com/ejguan	2022-09-15 21:48:28 +00:00
erjia	33bb8ae350	Set shuffle to DataPipes with set_shuffle API (#83741 ) This PR requires PR is landed: https://github.com/pytorch/pytorch/pull/83202 ## changes - For `apply_shuffle_setting` and `apply_shuffle_seed`, it makes sure it will apply shuffle setting to each of DataPipe that contains a method called `set_shuffle` or `set_seed`. - Change the API from `apply_shuffle_seed` to `apply_random_seed`. - Fix a bug that `apply_shuffle_seed` only accepts DataPipe that is hashable. After the PR, this function uses `id` to prevent seeding the same DataPipe multiple times per epoch. - Fix another bug from `shuffler` that `reset` with `_enable=False` would also reset `_seed`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83741 Approved by: https://github.com/NivekT	2022-09-13 13:38:58 +00:00
Kevin Tse	27e5299ee3	[DataPipe] Fix mishandling of exception message when error is not iterable (#84676 ) We sometimes get an exception message like this: ``` This exception is thrown by __iter__ of TarArchiveLoaderIterDataPipe(datapipe=FileOpenerIterDataPipe, length=-1, mode='r:') elif msg not in e.args[0] and single_iterator_msg not in e.args[0]: TypeError: argument of type 'int' is not iterable ``` The `TypeError` raised by the mishandling of the error message obfuscates the true exception, which now will be show as: ``` FileNotFoundError: [Errno 2] No such file or directory: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84676 Approved by: https://github.com/ejguan	2022-09-09 14:34:13 +00:00
Junteng Jia	335033f718	asyncio increase throughput (pytorch change) (#84301 ) Summary: This diffs add a check in the fetcher, that if the dataset to be fetched has a function "getitems" then use it for fetching a batch of elements, as oppose to one by one. This is benefical for io bounded usage. Differential Revision: D39145980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84301 Approved by: https://github.com/VitalyFedyunin	2022-09-08 17:00:45 +00:00
Kevin Tse	cfb9d0d233	[DataPipe] Fixing `map` function signature validation (#84279 ) As @pmeier [points out](https://github.com/pytorch/pytorch/pull/80267#discussion_r958423241), #80267 introduces a bug where an exception is thrown when a built-in function (or a function implemented in C) is used with `.map` because `inspect.signature(fn)` cannot find the function's signature. This PR skips over a function when its signature cannot be found. I believe this case is rare, and if the `fn` is truly incompatible with the usage of `input_col`/`output_col`, an exception will be raised at run time such that users will be able to examine what is wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84279 Approved by: https://github.com/pmeier, https://github.com/janeyx99	2022-08-31 19:55:01 +00:00
erjia	3f94726453	[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202 ) Fixes: https://github.com/pytorch/data/issues/718 This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202 Approved by: https://github.com/NivekT	2022-08-29 08:57:17 +00:00
PyTorch MergeBot	d50aa517b5	Revert "Add support to traverse all python collection objects (#84079 )" This reverts commit `e0f0c8e7b9`. Reverted https://github.com/pytorch/pytorch/pull/84079 on behalf of https://github.com/weiwangmeta due to Diff reverted internally	2022-08-29 06:34:50 +00:00
PyTorch MergeBot	7244a3737c	Revert "[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202 )" This reverts commit `a423c966a7`. Reverted https://github.com/pytorch/pytorch/pull/83202 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-08-28 18:00:17 +00:00
erjia	a423c966a7	[DataPipe] Convert MapDataPipe.shuffle to IterDataPipe (#83202 ) Fixes: https://github.com/pytorch/data/issues/718 This is an alternative PR against https://github.com/pytorch/pytorch/pull/82974 This PR would change the behavior for both types to the same behavior as `IterDataPipe.shuffle` - Lazily generating seed per iteration - Each iterators has a new seed - Convert `MapDataPipe.shuffle` to an `IterDataPipe` ## BC-breaking Note: This PR changes the return type of `MapDataPipe.shuffle` from a `MapDataPipe` to a `IterDataPipe`. ### 1. 12 Output as `MapDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) True >>> isinstance(dp, IterDataPipe) False ``` ### This PR: Output as `IterDataPipe` ``` >>> from torch.utils.data import IterDataPipe, MapDataPipe >>> from torch.utils.data.datapipes.map import SequenceWrapper >>> dp = SequenceWrapper(list(range(10))).shuffle() >>> isinstance(dp, MapDataPipe) False >>> isinstance(dp, IterDataPipe) True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83202 Approved by: https://github.com/NivekT	2022-08-26 23:33:20 +00:00
erjia	e0f0c8e7b9	Add support to traverse all python collection objects (#84079 ) Fixes https://github.com/pytorch/data/issues/752 This PR makes `traverse` function supporting more collections data structures from Python. Please let me know if anyone has a better idea about how to elegantly check if the object is a collection then we can dive into this object to see wether there is any DataPipe wrapped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84079 Approved by: https://github.com/NivekT	2022-08-26 21:02:43 +00:00
erjia	4c19981316	[DataPipe] Reset Shuffler's iterator when NotStarted (#83535 ) This PR changes the behavior of `IterDataPipe` to always invoke `reset` for the state of `NotStarted`. The main reason is we normally put lazy initialization code into `reset` function. Even for the state of `NotStarted`, we should invoke `reset` to initialize those lazy variables. Otherwise, we have to manually determine if the state is `NotStarted` or `Iterating` in `__iter__` function and only manually invoke `reset` in the state of `NotStarted`. This PR also makes `Shuffler` is able to serialize with `buffer` and `rng_state`. The following part is removed: ~I am also add `_snapshot_state` into serialization state and during `__setstate__` only change the state to `Restored` if the original state is `Iterating`. Especially, for the case of deserializing/serializing `NotStarted` DataPipe (multiprocessing), we would invoke `set_seed` for `Shuffler`. We need the `DataPipe` remains as `NotStarted` to properly `reset`.~ I am listing all the expected behavior state transition below: - Initial state: `NotStarted` - `iter` -> Call `reset` and change the state to `Iterating` - serialize/deserialize -> Keep the state as `NotStarted` (will `reset` if `iter` is called afterwards) - Initial state: `Iterating` - `iter` -> Call `reset` and keep the state to `Iterating` - serialize/deserialize -> Change the state as `Restored` - Initial state: `Restored` - `iter` -> Only change the state to `Iterating` - serialize/deserialize -> Not allowed Pull Request resolved: https://github.com/pytorch/pytorch/pull/83535 Approved by: https://github.com/NivekT	2022-08-25 19:45:41 +00:00
erjia	56fef4e6ee	fix `NoneType` object has no attribute `python_exit_status` (#83985 ) Fixes #83791 Prevents the Error when `_utils` has been cleared by Python before `__del__` is invoked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83985 Approved by: https://github.com/NivekT	2022-08-25 16:05:14 +00:00
Robert	5c49c7bbba	[WIP] Validating input_col for certain datapipes (#80267 ) Follow up from #79344. Currently WIP due to multiple test failures. Waiting for #80140 to land Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267 Approved by: https://github.com/ejguan	2022-08-24 17:34:28 +00:00
joncrall	b136f3f310	More doctest refinements. (#83317 ) Follow up to #82797 Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way. @ezyang @vadimkantorov Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317 Approved by: https://github.com/ezyang	2022-08-22 20:07:26 +00:00
albanD	3834836260	[DataLoader] Move loop content into a function to ensure we don't preserve anything (#83595 ) Can lead to CPU memory saving as we don't hold onto the pin memory buffer as long as we used to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83595 Approved by: https://github.com/ejguan, https://github.com/NivekT	2022-08-18 20:54:47 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
Kevin Tse	14b660fcc0	[DataPipe] Correct the type of exception that is being raised by ShufflerMapDataPipe (#82666 ) Fixes https://github.com/pytorch/data/issues/708 The following code snippet used to fail, now it has been added as a test case: ```python dp1 = dp.map.SequenceWrapper(range(10)) shuffle_dp1 = dp1.shuffle() dp2 = dp.map.SequenceWrapper(range(10)) shuffle_dp2 = dp2.shuffle() zip_dp = shuffle_dp1.zip(shuffle_dp2) list(zip_dp) # This used to fail ``` The issue was that `ShufflerMapDataPipe` raises a `KeyError` when an out of bound index is passed into it, but that was not handled by `zip_dp`'s `__getitem__` which only handled `IndexError`. With this change, it handles both. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82666 Approved by: https://github.com/ejguan	2022-08-03 19:05:17 +00:00
ProGamerGov	71d50f4f89	Change docstring type callable to Callable for consistency (#82487 ) ### Description Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function. ### Testing There shouldn't be any testing required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487 Approved by: https://github.com/albanD	2022-08-01 17:26:09 +00:00
Kevin Tse	35d97e21c8	[DataPipe] Simple graph snapshotting (#79479 ) This mostly completes the "poor man's snapshotting" implementation (named "simple snapshotting"). This is the most basic version of snapshotting but it should work for all DataPipes. I will be adding more efficient implementation for different types of DataPipes in future PRs. ### Implementation The general idea of the simple snapshot is that we will: 1. Create a new iterator 2. Move that iterator forward by `n_iterations` 3. Save that as the `_fast_forward_iterator` of the DataPipe 4. The next time `iter` is called on the DataPipe, use the `_fast_forward_iterator` ### Usage As of this implementation, the usage will something like: ```python rng = torch.Generator() initial_rng_state = rng.get_state() datapipe: IterDataPipe = ... # Some usage of the DataPipe, here maybe yielding the first 5 values n_iter = 5 it = iter(datapipe) for _ in range(n_iter): next(it) serialized_graph = pickle.dumps(datapipe) # The serialized object has most of the sufficient information for simple snapshot (except for initial RNG state) # It can be deserialized at a later point in time or by a different process deserialized_graph = pickle.loads(serialized_graph) # I think `DataLoader2` or `ReadingService` should store `initial_rng_state` that can be saved by the API that we later use rng_for_deserialized = torch.Generator() rng_for_deserialized.set_state(initial_rng_state) n_iterations = deserialized_graph._number_of_samples_yielded _simple_snapshot_graph(deserialized_graph, n_iterations, rng=rng_for_deserialized) # The while DataPipe graph should have the same state as before serialization, such that: self.assertEqual(list(it), list(deserialized_graph)) # True ``` ### Next Steps If this looks acceptable, the next step is I will modify `DataLoader2`'s prototype ReadingService (the one with queues) to remember things like `initial_rng_state` and to have methods `save_snapshot` that will return the `(serialized graph, initial_rng)` and `restore_snapshot`. This should work for single worker data loading. Note that, in the long term, `initial_rng_state` may not be necessary if we are able to directly save/restore the buffer and RNG state of `Shuffler` (that is work in progress). However, `initial_rng_state` and simple snapshot is still a good fall-back option for some edge cases where the buffer can't be stored. Differential Revision: [D37943406](https://our.internmc.facebook.com/intern/diff/D37943406) Pull Request resolved: https://github.com/pytorch/pytorch/pull/79479 Approved by: https://github.com/ejguan	2022-07-23 02:53:15 +00:00
Kevin Tse	428e44ffa1	[DataPipe] Fixes various warnings, exceptions, and clean up testing (#81833 ) I went through most of the warnings and exceptions raised in our tests to find these issues. Changes: 1. In testing, `self.assertEquals` is deprecated, converting to `self.assertEqual` to get rid of the warning 2. Small changes for cleanliness and get rid of warnings (no actual change to result) 3. Correct `is_every_instance_exhausted` logic for `_Forker` 4. Catch `RunTimeError` raised by invalidated iterator during clean up 5. Check if attribute `parent_stream` exists before trying to access it Differential Revision: [D38020122](https://our.internmc.facebook.com/intern/diff/D38020122) Pull Request resolved: https://github.com/pytorch/pytorch/pull/81833 Approved by: https://github.com/ejguan	2022-07-21 18:59:40 +00:00
erjia	aa1466d542	Raise proper timeout when sharing the distributed shared seed (#81666 ) Fixes https://github.com/pytorch/data/issues/659 - This would fix the problem that a slow DataLoader on rank 0 would cause TimeoutError as I have removed the `wait` operation on other Ranks. - This PR also adds a [default timeout](`f6a45f7984/torch/csrc/distributed/c10d/ProcessGroup.hpp (L26-L27)`) as 30 * 60 seconds (taking reference from the distributed team's implementation). When the distributed seed is stuck on any rank, a proper timeout with detailed message will be raised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81666 Approved by: https://github.com/NivekT	2022-07-19 17:21:02 +00:00
erjia	ccbf04dd5f	[DataPipe] Fix fork/unzip with a single child (#81502 ) When `Forker` or `Unzipper` only contains a single child, the buffer should be cleaned up. This is one of the root causes for the issue reported internally. See: https://fburl.com/2k0et1gv Pull Request resolved: https://github.com/pytorch/pytorch/pull/81502 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT	2022-07-18 16:53:19 +00:00
Matthew Caseres	00359ff886	Fix docstring on FileOpenerIterDataPipe (#81407 ) It said the default argument was `b` not `r` Fixes #81406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81407 Approved by: https://github.com/kit1980	2022-07-16 01:01:39 +00:00
erjia	2f5d4cf90c	Fix mypy for IterDataPipe.collate (#81275 ) Add `default_collate` to mypy stub file to make sure `default_collate` is imported for `IterDataPipe.collate` Sister PR from TorchData: https://github.com/pytorch/data/pull/645 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81275 Approved by: https://github.com/NivekT	2022-07-13 15:54:14 +00:00
Erjia Guan	782f18e9b5	[DLv2] Make graph `traverse` working with unhashable `DataPipe` (#80509 ) Summary: This Diff removes the requirement for `traverse` function that `DataPipe` needs to be hash-able. `traverse` function now is using `id` of `DataPipe` instance rather than `DataPipe` itself as the key for both `cache` and graph. But, it requires the changes of type of `DataPipeGraph` from `Dict[DataPipe, "DataPipeGraph"]` to `Dict[int, Tuple[DataPipe, "DataPipeGraph"]]`. Differential Revision: D37354153 Ref PR in TorchData: https://github.com/pytorch/data/pull/559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80509 Approved by: https://github.com/VitalyFedyunin	2022-07-12 14:47:42 +00:00
Vitaly Fedyunin	e9b3bc2ead	[DataLoader] Locking lower ranks seed recepients (#81071 ) Exit seed receiving section only when all ranks received seed, otherwise we are at risk that current rank will reach same section of the code again while rank zero still in the previous iteration Fixes: #80845 Differential Revision: [D37702557](https://our.internmc.facebook.com/intern/diff/D37702557) Pull Request resolved: https://github.com/pytorch/pytorch/pull/81071 Approved by: https://github.com/msaroufim, https://github.com/ejguan	2022-07-08 18:53:45 +00:00
Vitaly Fedyunin	bcab5257de	Expanding DataPipe to support DataFrames (#71931 ) Differential Revision: [D37500516](https://our.internmc.facebook.com/intern/diff/D37500516) Pull Request resolved: https://github.com/pytorch/pytorch/pull/71931 Approved by: https://github.com/ejguan	2022-07-08 18:46:10 +00:00
Vitaly Fedyunin	331c0c1803	[DataLoader] Close open in DataPipe streams on best effort basis (#78952 ) Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures Differential Revision: [D37489935](https://our.internmc.facebook.com/intern/diff/D37489935) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78952 Approved by: https://github.com/ejguan	2022-06-29 20:11:23 +00:00
Kevin Tse	b8e50f512f	[DataPipe] Count number of successful yields for IterDataPipe (#79657 ) This PR adds an attribute and logic to count the number of successful yields from `IterDataPipe`. This information can be useful to fast-forward a DataPipe (or the entire graph) back to a certain state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79657 Approved by: https://github.com/VitalyFedyunin	2022-06-28 17:30:33 +00:00
erjia	3ec9d34f21	Fix distributed store to use add for the counter of DL shared seed (#80348 ) In order to get the result of `_shared_seed_recv_cnt` properly, switch from `store.get` to `store.add(key, 0)`. See the comment from distributed team for the reason: `590d3e5774/torch/distributed/distributed_c10d.py (L242-L246)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/80348 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT	2022-06-27 21:59:17 +00:00
Erjia Guan	3d218e1c87	Raise warning for unpickable local function (#547 ) (#80232 ) Summary: X-link: https://github.com/pytorch/data/pull/547 Fixes https://github.com/pytorch/data/issues/538 - Improve the validation function to raise warning about unpickable function when either lambda or local function is provided to DataPipe. - The inner function from functools.partial object is extracted as well for validation - Mimic the behavior of pickle module for local lambda function: It would only raise Error for the local function rather than lambda function. So, we will raise warning about local function not lambda function. ```py >>> import pickle >>> def fn(): ... lf = lambda x: x ... pickle.dumps(lf) >>> pickle.dumps(fn) AttributeError: Can't pickle local object 'fn.<locals>.<lambda>' ``` This Diff also fixes the Error introduced by https://github.com/pytorch/pytorch/pull/79344 Test Plan: CI on PyTorch and TorchData Manually validated the tests from TorchVision Differential Revision: D37417556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80232 Approved by: https://github.com/NivekT	2022-06-27 21:47:09 +00:00
PyTorch MergeBot	fcdaf35114	Revert "Add validation for mapper function in datapipes with `input_col` (#79344 )" This reverts commit `787ac4edf8`. Reverted https://github.com/pytorch/pytorch/pull/79344 on behalf of https://github.com/ejguan due to This PR breaks multiple use cases and the CI from TorchVision becomes red	2022-06-24 17:17:33 +00:00
PyTorch MergeBot	79ba65c0f2	Revert "Raise warning for unpickable local function (#80140 )" This reverts commit `4b75b7d3c1`. Reverted https://github.com/pytorch/pytorch/pull/80140 on behalf of https://github.com/ejguan due to It will break the CI for TorchData	2022-06-24 14:49:06 +00:00
erjia	4b75b7d3c1	Raise warning for unpickable local function (#80140 ) Fixes https://github.com/pytorch/data/issues/538 - Improve the validation function to raise warning about unpickable function when either lambda or local function is provided to `DataPipe`. - The inner function from `functools.partial` object is extracted as well for validation - Mimic the behavior of `pickle` module for local lambda function: It would only raise Error for the local function rather than `lambda` function. So, we will raise warning about local function not lambda function. ```py >>> import pickle >>> def fn(): ... lf = lambda x: x ... pickle.dumps(lf) >>> pickle.dumps(fn) AttributeError: Can't pickle local object 'fn.<locals>.<lambda>' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/80140 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT	2022-06-24 13:50:51 +00:00
Robert	787ac4edf8	Add validation for mapper function in datapipes with `input_col` (#79344 ) As linked in https://github.com/pytorch/data/issues/362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79344 Approved by: https://github.com/ejguan, https://github.com/NivekT	2022-06-23 18:49:35 +00:00
erjia	ccccd0efec	[DataLoader] Share seed via Distributed Store to get rid of CUDA dependency (#79829 ) Fixes #79828 In distributed environment, before this PR, DataLoader would create a Tensor holding the shared seed in RANK 0 and send the Tensor to other processes. However, when `NCCL` is used as the distributed backend, the Tensor is required to be moved to cuda before broadcasted from RANK 0 to other RANKs. And, this causes the Issue where DataLoader doesn't move the Tensor to cuda before sharing using `NCCL`. After offline discussion with @mrshenli, we think the distributed Store is a better solution as the shared seed is just an integer value. Then, we can get rid of the dependency on NCCL and CUDA when sharing info between distributed processes for DataLoader. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79829 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT	2022-06-20 19:18:35 +00:00
Kevin Tse	e8ed16f3c0	[DataPipe] Enable profiler record context in __next__ branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/79757 Approved by: https://github.com/ejguan	2022-06-17 16:52:07 +00:00
Kevin Tse	25ca006707	[DataPipe] Refactor _hook_iterator for readability Pull Request resolved: https://github.com/pytorch/pytorch/pull/79656 Approved by: https://github.com/ejguan	2022-06-17 16:52:07 +00:00
Robert	3064982fb8	Support percentages in random_split (#78877 ) Fixes #78510 This PR adds support for using fractions with `random_split`. This should be completely backwards-compatible as the fractional-style splitting is only applied when the sum across the input lengths is lower than 1.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78877 Approved by: https://github.com/ejguan	2022-06-16 02:00:25 +00:00
Kevin Tse	22c7b1ddb5	[DataPipe] Fix error message coming from singler iterator constraint Pull Request resolved: https://github.com/pytorch/pytorch/pull/79547 Approved by: https://github.com/ejguan	2022-06-14 21:38:36 +00:00
erjia	04f87f2ab9	[DataLoader] Fix the world_size when distributed sharding MapDataPipe (#79524 ) Fixes #79449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79524 Approved by: https://github.com/NivekT, https://github.com/VitalyFedyunin	2022-06-14 19:03:57 +00:00
PyTorch MergeBot	35eda5f959	[DataPipe] Correcting deprecation version Pull Request resolved: https://github.com/pytorch/pytorch/pull/79302 Approved by: https://github.com/ejguan	2022-06-10 19:31:29 +00:00
ErjiaGuan	5158a6b41a	Foward fix sharding bug for DL (#79124 ) This PR solves a bug introduced by #79041 `torch.utils.data.graph_settings.apply_sharding` changes the datapipe in-place and returns `None` It would resolve the Error in TorchData. See: https://github.com/pytorch/data/actions/runs/2461030312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79124 Approved by: https://github.com/VitalyFedyunin	2022-06-08 16:16:58 +00:00
erjia	b3ed65343d	Fix sharding strategy for distributed DL (#79041 ) 1. Change the sharding strategy from sharding by worker first then by rank to sharding in the order of rank then workers. 2. Change to fetch Rank and World size in main process for the sake of `spawn`. For the change 1: Before this PR, for the case when dataset can not be evenly divided by `worker_num * world_size`, more data will be retrieved by workers in first RANKs. Using the following example: - dataset size: 100 - world_size: 4 - num_worker: 2 The number of data retrieved by each rank before this PR - Rank 0: 26 - Rank 1: 26 - Rank 2: 24 - Rank 3: 24 The number of data retrieved by each rank after this PR - Rank 0: 25 - Rank 1: 25 - Rank 2: 25 - Rank 3: 25 For the change 2: Before this PR, `dist` functions are invoked inside worker processes. It's fine when the worker processes are forked from the parent process. All environment variables are inherited and exposed to these `dist` functions. However, when the worker processes are spawned, they won't be able to access to these environment variables, then the dataset won't be sharded by rank. After this PR, `_sharding_worker_init_fn` should be working for both `spawn` and `fork` case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79041 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT	2022-06-07 20:56:32 +00:00
Kevin Tse	42fac176eb	[DataPipe] Add function for deprecation of functional DataPipe names Pull Request resolved: https://github.com/pytorch/pytorch/pull/78970 Approved by: https://github.com/ejguan	2022-06-07 00:14:47 +00:00
Kevin Tse	c44472c5b1	[DataPipe] Disable profiler for IterDataPipe by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/78674 Approved by: https://github.com/VitalyFedyunin	2022-06-06 22:12:56 +00:00
Vitaly Fedyunin	6fe6902f97	[DataLoader] Apply sharding settings in dist when num_workers is 0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78950 Approved by: https://github.com/ejguan, https://github.com/NivekT	2022-06-06 20:03:02 +00:00

1 2 3 4 5 ...

501 Commits