Commit Graph

10 Commits

Author SHA1 Message Date
Luis Galeana
6b0545d764 Implemented logging of inputs per second
Summary: Every time data is put into the logger, it checks if a second has passed. If so, it displays how many inputs were put in the last second.

Differential Revision: D4527148

fbshipit-source-id: f197eb975ed81111449705e0719d1e56f385fd8d
2017-02-16 12:02:05 -08:00
Renbin Peng
7ca1c0e405 Add two data_loaders and refactor code
Summary:
(1) Add two dataloaders, everstore and squashfs
(2) Refactor code

Differential Revision: D4500365

fbshipit-source-id: f70fb40ca29cdbfb46da5f3f6322f2d953c01903
2017-02-11 02:13:36 -08:00
Aapo Kyrola
849fc7ba68 check that parameter is int
Summary: One trainer passed (10,) as the max_buffer_size parameter, causing the internal queue to grow out of bounds as qsize == (10,) never was true. This adds assertion to the type of the parameter.

Reviewed By: prigoyal

Differential Revision: D4527649

fbshipit-source-id: 492a824700b8fc69c484b80773b1f1f5aee39071
2017-02-08 03:04:04 -08:00
Aapo Kyrola
6a03641cde Add num_iters to RunNet()
Summary:
Running RunNet() in python in a loop can be a performance issue if the python code is doing a lot of other processing, such as data input, because python's Global Interpreter lock (GIL) will prevent the RunNet() to be called. This can easily be fixed by making RunNet() run multiple iterations inside the C++ land. (Another way to accomplish the same thing is to use Caffe2's "execution plans", but that requires more setup).

+ fixed timing reporting in my OC workflow
+ improved one error log in data_workers.py

Sorry for piggypagging those small changes, but landing diffs currently is slow...

Reviewed By: rpenggithub

Differential Revision: D4523575

fbshipit-source-id: 039a647576efad5dd9afda74df478ac22b43c103
2017-02-07 14:16:14 -08:00
Aapo Kyrola
50213705d4 Allow specifying max buffer size. Smaller initial size.
Summary:
I recently encountered out-of-memory errors on my OC workflow. This was because the internal queue for buffering image patches was too large. Total memory use was:
  image size = 227 x 227 x 3 x 4
  total mem = image size x queuesize (500) x num gpus x everstore-worker batch (128) > 300 gigs.

Reducing the batch size to 100 should fix this. Also can now specify as a parameter.

Reviewed By: rpenggithub

Differential Revision: D4519956

fbshipit-source-id: 781697e620431ce7053534e683047bb6e7257b22
2017-02-06 22:01:56 -08:00
Aapo Kyrola
82f1a8e12d fix code doc for data_workers
Summary: Fix bug in doc as reported by rpenggithub

Reviewed By: rpenggithub

Differential Revision: D4356796

fbshipit-source-id: a35e54247d84ba29ef1b8e8cac0de8a3d30b489e
2016-12-21 09:29:43 -08:00
Aapo Kyrola
35fa9e9c5f a couple small reliability improvements
Summary:
A couple of more misc changes:
- allow starting the coordinator multiple times -- this makes data parallel programming easier
- make the fetcher id a global sequence, before each gpu had same ids for workers
- my flow jobs got stuck when joining the fetcher threads. I think there is actually a memory fencing problem with the is_active boolean. But I am too tired to add proper condition variables there. Instead just add timeout to join(). It is needed anyway since some i/o thread could get blocked.

Differential Revision: D4333381

fbshipit-source-id: 88226c8a9c9a5e05d771360a502a2ba21a6b9d76
2016-12-15 21:29:29 -08:00
Aapo Kyrola
2bf18f2b1d add inception and dummy input
Summary:
As requested by Yangqing, added Inception model (copied from convnet_benchmarks) and a dummy data feed option to the xray trainer, that we use for scalability benchmarking.

+ a couple of minichanges to the data input framework

Reviewed By: Yangqing

Differential Revision: D4327024

fbshipit-source-id: 86911468456fc13a32d5f437a43347380ec66a68
2016-12-15 13:40:22 -08:00
Aapo Kyrola
e80423f341 bug fix to distringuish train/test data
Summary:
We often use same net for training and testing, but we must distinguish their data. My yestterday's diff forgot to include that distinction (it was in the xray sampler before), and this diff adds it. Basically one provides a name for the input source for data_workers, and all the queues and scratch spaces are suffixed with that to separate them.

Also specify the caffe2 queue's size to 4, which is empirically found to be sufficient. It was errorneously defined to be function of batch size, which does not make sense as each *element* in the queue is a batch, and led to out of memory issues on xray trainer.

Differential Revision: D4329449

fbshipit-source-id: c994da1c8b0935b8eda2402c118d49b76caa7da8
2016-12-15 12:01:31 -08:00
Aapo Kyrola
0b52b3c79d Generalize threaded data input via queues + Everstore input
Summary:
Xray sampler (originally by ajtulloch) and prigoyal's resnet trainer use variants of the threaded data input where worker threads put stuff into a python queue that is drained by an enqueuer thread that dumps those batches to a Caffe2 queue, that is then drained by the net's DequeueBlobs operator.

There is a lot of boilerplate, which is also quite complicated.

This diff is an attempt to generalize that general stuff under a new module "data_workers" (name could be improved). Basically you pass it a function that is able to return chunks of data (usually data + labels).

I also created a module 'everstore_data_input' which generalizes everstore-origin data input with preprocessing function (image augmentation , for example). See how I refactored sampler.py for the usage.

Next we could create fetcher function for Laser data.

Differential Revision: D4297667

fbshipit-source-id: 8d8a863b177784ae13940730a27dc76cd1dd3dac
2016-12-15 12:01:30 -08:00