Commit Graph

17 Commits

Author SHA1 Message Date
Jaliya Ekanayake
bb3a2d99ac Jaliyae/chunk buffer fix (#17409)
Summary:
The chunk buffer had a possibility to hang when no data is read and the buffer size is lower than chunk size. We detected this while running with larger dataset and hence the fix. I added a test to mimic the situation and validated that the fix is working. Thank you Xueyun for finding this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17409

Differential Revision: D14198546

Pulled By: soumith

fbshipit-source-id: b8ca43b0400deaae2ebb6601fdc65b47f32b0554
2019-02-23 08:48:53 -08:00
Jaliya Ekanayake
9477c143c6 C++ Frontend: adding two distributed samples (Random and Sequential) (#16910)
Summary:
Adding two distrbuted samplers, Random and Sequential to the mix. Similar to python counterpart, DistributedSampler introduces a new method `set_epoch(size_t epoch)` which can be use to shuffle data determinstically between distributed processes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16910

Differential Revision: D14130980

Pulled By: soumith

fbshipit-source-id: ec08b7130c01e2fc6dc3693f7ac622a0a6d60f10
2019-02-19 05:40:37 -08:00
Jaliya Ekanayake
bc39cf4d5e Remove chunk count check on the ChunkBuffer (#16868)
Summary:
Previously, the ChunkBuffer depends on the remaining chunk count to signal end of dataloading. This does not work with distributed samplers where each sampler only loads a subset of  chunks. This refactor remove the dependency on the remaining chunk count at the ChunkBuffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16868

Differential Revision: D14066517

Pulled By: goldsborough

fbshipit-source-id: 293dfe282ceff326dff0876c2f75c2ee4f4463e2
2019-02-13 11:09:42 -08:00
xuzhu
6249442e90 Chunk dataset implementation (#15932)
Summary:
This PR contains the implementation of chunk dataset, with the API proposed in PR https://github.com/pytorch/pytorch/pull/15562

A chunk dataset is derived from StatefulDataset. It utilizes worker threads to prefetches chunk data, splits it into batches and caches them into a queue. When get_batch is called from dataloader, batch data is retrieved from the queue, and data in new chunks will be pushed for later following batches.

Chunk dataset uses two samplers (chunk_sampler and example_sampler) to perform sampling. The chunk_sampler decides which chunk to load, and example_sampler shuffles the examples inside a specific chunk. More detail of this sampling approach can be found here: http://martin.zinkevich.org/publications/nips2010.pdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15932

Differential Revision: D13868688

Pulled By: soumith

fbshipit-source-id: a43000c478ca2a3c64cc84b3626d6b8b1ad9a07e
2019-01-29 18:06:01 -08:00
Peter Goldsborough
a4c1aa4bc5 Add the normalize transform to the core library (#15891)
Summary:
Adds the `Normalize` transform to the core C++ frontend library.

ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15891

Differential Revision: D13642167

Pulled By: goldsborough

fbshipit-source-id: 573428e626d6106cf2aadf3dc2e2aecb9a85efc3
2019-01-11 19:50:18 -08:00
Peter Goldsborough
ad6799537e Support stateful dataset (#15096)
Summary:
Currently re-implements the dataloader for stateful datasets. Outstanding work:
- Refactor DataLoader and DataLoader2 to have common base classes and only differ in specifi pieces of logic,
- Figure out how to not duplicate the `MapDataset` logic for stateful vs. non-stateful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15096

Differential Revision: D13522043

Pulled By: goldsborough

fbshipit-source-id: 08e461ca51783047f11facc4d27dfa2e4f1e4c2a
2018-12-24 06:26:40 -08:00
Jaliya Ekanayake
44cb43bcc1 Jaliyae/samplers (#13870)
Summary:
Make Samplers optionally accept new size in their reset() method. This helps dataloader or dataset to reset the sampler for an epoch or a chunk of data with different sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13870

Differential Revision: D13240120

Pulled By: soumith

fbshipit-source-id: 19c53f8be13c0fdcf504f0637b0d3e6009a8e599
2018-11-29 07:07:19 -08:00
Peter Goldsborough
f639249d51 Fix dataloader iterator test (#14045)
Summary:
I noticed the test `DataLoaderTest.CanDereferenceIteratorMultipleTimes` doesn't test proper progression of the iterator. I also added a test for using `std::copy`.

Fixes https://github.com/pytorch/pytorch/issues/14276

ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14045

Differential Revision: D13092187

Pulled By: goldsborough

fbshipit-source-id: 57698ec00fa7b914b159677a4ab38b6b25c2860b
2018-11-26 17:06:41 -08:00
Peter Goldsborough
fb6535ec70 Add SharedDataset (#13800)
Summary:
This PR adds a `SharedDataset` to the C++ frontend data API, which allows wrapping a shared_ptr to a dataset into a class that conforms to the `Dataset` interface (with `get_batch`). This enables use cases where a custom dataset is (1) thread-safe and (2) expensive to copy. All workers will reference a single instance of this dataset. No additional copies are incurred.

jaliyae apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13800

Differential Revision: D13075610

Pulled By: goldsborough

fbshipit-source-id: 4ffdfd7959d49b042c0e254110085f62a0bfeb6c
2018-11-16 13:07:10 -08:00
Peter Goldsborough
8f4dc192b6 Fix DataLoaderTest.EnforcesOrderingAmongThreadsWhenConfigured (#14038)
Summary:
I think this will be it. So for one, the previous test was bullshit because it was returning the thread id instead of the sample index (which is the thing whose ordering is enforced). Just turning up the number of threads to 10 from 4 made this very obvious. I also think there is a race condition, which may or may not have surfaced, in that there was nothing stopping one worker to get multiple batches, which would screw with the whole ordering logic. I've added a barrier struct such that workers wait for all workers to be in the `get_batch` function before actually doing something.

Fixes https://github.com/pytorch/pytorch/issues/14002

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14038

Differential Revision: D13088132

Pulled By: goldsborough

fbshipit-source-id: 4bded63756c6a49502ee07ef8709a03073e7e05f
2018-11-15 17:30:41 -08:00
Edward Yang
0478d32cb8 Move AlignOf, SmallVector and ArrayRef to c10.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13916

Reviewed By: smessmer

Differential Revision: D13046722

fbshipit-source-id: 1583d3170d60e22f0a535cd1fd56bdf928186f5d
2018-11-14 11:13:16 -08:00
Peter Goldsborough
5151d33287 Unflake the ordering enforcement test (#13919)
Summary:
Attempts to unflake the dataloader ordering enforcement test. I think the issue was that the `thread_counter` variable was not atomic. I've made it atomic, and also global just to make it a bit clearer.

Fixes https://github.com/pytorch/pytorch/issues/13634

colesbury SsnL ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13919

Differential Revision: D13051718

Pulled By: goldsborough

fbshipit-source-id: b9f7f6317701a8b861a1d5c6a9b2b17b44782561
2018-11-13 21:05:02 -08:00
Peter Goldsborough
393ad6582d Use torch:: instead of at:: in all C++ APIs (#13523)
Summary:
In TorchScript and C++ extensions we currently advocate a mix of `torch::` and `at::` namespace usage. In the C++ frontend I had instead exported all symbols from `at::` and some from `c10::` into the `torch::` namespace. This is far, far easier for users to understand, and also avoid bugs around creating tensors vs. variables. The same should from now on be true for the TorchScript C++ API (for running and loading models) and all C++ extensions.

Note that since we're just talking about typedefs, this change does not break any existing code.

Once this lands I will update stuff in `pytorch/tutorials` too.

zdevito ezyang gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13523

Differential Revision: D12942787

Pulled By: goldsborough

fbshipit-source-id: 76058936bd8707b33d9e5bbc2d0705fc3d820763
2018-11-06 14:32:25 -08:00
Peter Goldsborough
8fafa7b6ac Remove size() from BatchDataset and templatize IndexType (#12960)
Summary:
This PR brings to changes to the recently landed C++ Frontend dataloader:

1. Removes the `size()` method from `BatchDataset`. This makes it cleaner to implement unsized ("infinite stream") datasets. The method was not used much beyond initial configuration.
2. Makes the index type of a dataset a template parameter of `BatchDataset` and `Sampler`. This essentially allows custom index types instead of only `vector<size_t>`. This greatly improves flexibility.

See the `InfiniteStreamDataset` and `TestIndex` datasets in the tests for what this enables.

Some additional minor updates and code movements too.

apaszke SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12960

Differential Revision: D12893342

Pulled By: goldsborough

fbshipit-source-id: ef03ea0f11a93319e81fba7d52a0ef1a125d3108
2018-11-05 17:13:09 -08:00
Peter Goldsborough
c21471c77f Sampler serialization and deserialization (#12999)
Summary:
Implements serialization and deserialization for samplers in the C++ frontend dataloader.

apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12999

Differential Revision: D10859676

Pulled By: goldsborough

fbshipit-source-id: cd132100fd35323e5a3df33e314511750806f48d
2018-10-26 12:20:51 -07:00
Dmytro Dzhulgakov
49046239f2 Change explicit usages of at::optional to c10::optional (#13082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13082

Follow up of D10511254. For these cases we can move to preferred `optional` without namespace right away.

Reviewed By: ezyang, Yangqing

Differential Revision: D10844117

fbshipit-source-id: 99a59e692fb4b236b299579f937f1536d443d899
2018-10-25 15:17:53 -07:00
Peter Goldsborough
a022fd2d6b Implement DataLoader (#11918)
Summary:
This PR implements a DataLoader API for the C++ frontend.

The components present in this API largely match the Python API. It consists of:
- `Dataset`s: Conceptually a function from a set of indices to a batch of examples;
- `Transform`s: A functional transformation of a dataset. A `Map<D, T>` for Dataset `D` and transform `T` is itself a dataset;
- `Sampler`s: Specify a strategy for generating indices for a new batch;
- A `DataLoader`, with the ability to automatically parallelize fetching of samples across multiple worker threads;

Note that collation functions fall naturally out of the `Map<Dataset, Transform>` abstraction.

Things that are missing right now that maybe should be added:
- Memory pinning for CUDA tensors

The API was designed to be generalizable to almost any kind of dataset, transform or sampling strategy, while providing a convenient API out of the box. To achieve this, it is quite heavily templatized on various possible input types.

There are many parts to this PR! Right now, I would like feedback on:
- Your impression of the general usability of the API;
- Your impression of which parts seem too complex or overthought;
- The implementation of the parallelization aspects of the DataLoader. I've followed the Python implementation in some matters, but also differ in others. I think my implementation is a little cleaner and decouples components slightly better than the Python dataloader.

I haven't added too many comments yet, as this is fresh out of the oven. Let me know if anything is unclear from the code itself.

There also aren't any tests yet. I will write a comprehensive test suite once we agree on the API and implementation.

apaszke ezyang The controller you requested could not be found. pietern
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11918

Reviewed By: ezyang

Differential Revision: D9998881

Pulled By: goldsborough

fbshipit-source-id: 22cf357b63692bea42ddb1cc2abc71dae5030aea
2018-10-22 10:22:41 -07:00