Fixes#158631
The docstring said data_source was a Dataset, but RandomSampler only needs something that implements __len__. This updates the docstring to use Sized instead, which matches the actual type used in the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158857
Approved by: https://github.com/divyanshk
Builds upon #76951.
Benchmarking code is the same as in #76950.
AMD Ryzen Threadripper PRO 3995WX:
```
batch_size drop_last origin new speedup
------------ ----------- -------- ------ ---------
4 True 0.94 0.5706 64.74%
4 False 0.9745 0.9468 2.93%
8 True 0.7423 0.3715 99.82%
8 False 0.7974 0.5666 40.73%
64 True 0.5394 0.2085 158.76%
64 False 0.6083 0.2697 125.51%
640 True 0.5448 0.1985 174.41%
640 False 0.7085 0.2308 206.91%
6400 True 0.5554 0.2028 173.88%
6400 False 0.7711 0.2109 265.60%
64000 True 0.556 0.2091 165.82%
64000 False 0.7803 0.2078 275.58%
```
When `drop_last == True`, it uses `zip` to speed things up.
When `drop_last == False`, it uses `itertools` to speed things up.
`itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now.
Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423
Approved by: https://github.com/Skylion007
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
In our DDP training workloads, each rank was initializing a `RandomSampler` for a dataset with a length of 3.5 billion items. We noticed that when this sampler was in scope, `gc.collect` calls were taking on the order of seconds to run, which would slow down the entire training iteration. This is because when we call `torch.randperm(n).tolist()`, we create a python list of 3.5 billion items, which massively slows down the periodic mark & sweep garbage collection.
This PR swaps out the `.tolist()` call with a `.numpy()` call and manually calls `.item()` on each element as it is being requested. This has two benefits:
1. The first call to `RandomSampler::__next__` should be about twice as fast, since `.numpy` does not copy the contents of the original tensor
2. The runtime of `gc.collect()` calls no longer scales linearly with the size of the dataset passed to `RandomSampler`
I've attached some `timeit` samples to illustrate the speedups with this Pr:
```
Main (no GC): 51.72115747816861
Main (10 GC calls) 83.61965207383037
PR (no GC) 33.06403830461204
PR (10 GC calls) 33.959467427805066
```
Code
```python
from timeit import timeit
baseline_no_gc = """
import torch
n = int(1e9)
steps = n // 100
x = torch.randperm(n).tolist()
x_iter = iter(x)
for i in range(steps):
next(x_iter)
"""
baseline_gc = """
import torch
import gc
n = int(1e9)
steps = n // 100
gc_every = steps // 10
x = torch.randperm(n).tolist()
x_iter = iter(x)
for i in range(steps):
next(x_iter)
if i % gc_every == 0:
gc.collect()
"""
numpy_no_gc = """
import torch
n = int(1e9)
steps = n // 100
x = torch.randperm(n).numpy()
x_iter = (i.item() for i in x)
for i in range(steps):
next(x_iter)
"""
numpy_gc = """
import torch
import gc
n = int(1e9)
steps = n // 100
gc_every = steps // 10
x = torch.randperm(n).numpy()
x_iter = (i.item() for i in x)
for i in range(steps):
next(x_iter)
if i % gc_every == 0:
gc.collect()
"""
if __name__ == "__main__":
print("Main (no GC): ", timeit(baseline_no_gc, number=1))
print("Main (10 GC calls)", timeit(baseline_gc, number=1))
print("PR (no GC)", timeit(numpy_no_gc, number=1))
print("PR (10 GC calls)", timeit(numpy_gc, number=1))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103339
Approved by: https://github.com/kit1980
This is a new version of #15648 based on the latest master branch.
Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.
In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)
Fixes https://github.com/pytorch/pytorch/issues/71105
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
Fixes#78236
An erronously shaped weights vector will result in the following output
```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/datarwe/pytorch/torch/utils/data/sampler.py in <module>
[274](file:///home/oliver/datarwe/pytorch/torch/utils/data/sampler.py?line=273) WeightedRandomSampler([1,2,3], 10)
----> [275](file:///home/oliver/datarwe/pytorch/torch/utils/data/sampler.py?line=274) WeightedRandomSampler([[1,2,3], [4,5,6]], 10)
~/datarwe/pytorch/torch/utils/data/sampler.py in __init__(self, weights, num_samples, replacement, generator)
[192](file:///home/oliver/datarwe/pytorch/torch/utils/data/sampler.py?line=191) weights = torch.as_tensor(weights, dtype=torch.double)
[193](file:///home/oliver/datarwe/pytorch/torch/utils/data/sampler.py?line=192) if len(weights.shape) != 1:
--> [194](file:///home/oliver/datarwe/pytorch/torch/utils/data/sampler.py?line=193) raise ValueError("weights should be a 1d sequence but given "
[195](file:///home/oliver/datarwe/pytorch/torch/utils/data/sampler.py?line=194) "weights have shape {}".format(tuple(weights.shape)))
[196](file:///home/oliver/datarwe/pytorch/torch/utils/data/sampler.py?line=195)
ValueError: weights should be a 1d sequence but given weights have shape (2, 3)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78585
Approved by: https://github.com/NivekT, https://github.com/ejguan
Summary:
X-link: https://github.com/pytorch/data/pull/368
This is PR aims to expose the right data-relate API.
There are two more changes made in this PR to convert public api to private api
`check_lambda_fn` -> `_check_lambda_fn`
`deprecation_warning` -> `_deprecation_warning`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76143
Reviewed By: albanD, NivekT
Differential Revision: D35798311
Pulled By: ejguan
fbshipit-source-id: b13fded5c88a533c706702fb2070c918c839dca4
(cherry picked from commit 0b534b829a2e90e1e533951c6d334fdeaa9358b9)
Summary:
More details can be found in the old pr: https://github.com/pytorch/pytorch/pull/53085
ejguan Thanks for your guidance. I tried to reopen this PR following your instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63026
Reviewed By: anjali411
Differential Revision: D30224920
Pulled By: ejguan
fbshipit-source-id: 2fa83bd4a2661485e553447fe3e57ce723f2716d
Summary:
In `__iter__` of the `RandomSampler`, when `self.replacement` is `False` in the original code, `self.generator` is always used in the `torch.randperm` instead of the generator we set.
Fixes https://github.com/pytorch/pytorch/issues/52568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52956
Reviewed By: mruberry
Differential Revision: D26724303
Pulled By: H-Huang
fbshipit-source-id: 86f2795c76f3548e31181fb077af046078a173cb
Summary:
Fix https://github.com/pytorch/pytorch/issues/32530
I used the next() function to generate samples one at a time. To compensate replacement=False, I added a variable called "sample_list" to RandomSampler for random permutation.
cc SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40026
Reviewed By: zhangguanheng66
Differential Revision: D22519869
Pulled By: ezyang
fbshipit-source-id: be65850025864d659a713b3bc461b25d6d0048a2
Summary:
Since the check was added in https://github.com/pytorch/pytorch/pull/6249, one can not pass an iterable as a sampler to the data loader anymore, which was a very handy feature (e.g., https://github.com/pytorch/pytorch/issues/1337). I think the check should be removed for two-fold reasons:
1. It is too strict. There is no reason that it should not be a general iterable.
2. It is inconsistent. In `DataLoader` (the main place where people use samplers), you can pass a general iterable as `batch_sampler` but not `sampler` due to this check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38403
Differential Revision: D21555958
Pulled By: soumith
fbshipit-source-id: c7267bb99a31edd8f2750689205d6edc5dab5cff
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.
1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.
1. `Iterable` if the `dataset` is an instance of `IterableDataset`
2. `Map` o.w.
3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`
Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228
Reviewed By: bddppq
Differential Revision: D15058152
fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde