pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	1a831ce8f2	Add direct enqueuing to enable RNN input, allow specify batch columns Summary: Add a parameter dont_rebatch to data_workers. This disables batching of input from fetcher to equal-batch size chunks. This is not desired with RNNs where with longer sequence length we might want to have smaller batches etc. For some reason the graceful-shutdown test interfered with other tests, so I removed it. Reviewed By: jay-mahadeokar Differential Revision: D4988549 fbshipit-source-id: cbab46d77c948f2e293e79e6eb538dde17d800ee	2017-05-03 14:49:44 -07:00
Aapo Kyrola	2c59f017e6	Port Xray OC workflow to elastic_data_parallel_model Summary: As in the title + added scuba logging of the results. Reviewed By: andrewwdye Differential Revision: D4974261 fbshipit-source-id: 3e05b97133be95ffe37c8bcafd8a5a6bf3e7da93	2017-05-01 00:32:47 -07:00
Aapo Kyrola	6a1ef687f6	Free scratch blobs when data workers exits, add utility function to reset blobs Summary: Free scratch blobs at data workers exit. Also add utility function that you can use to reset gradient blobs easily: from caffe2.python import utils grad_blobs = [b for b in workspace.Blobs() if b.endswith("_grad") or b.endswith("_shared")] utils.ResetBlobs(grad_blobs) Reviewed By: rpenggithub Differential Revision: D4955531 fbshipit-source-id: d33b2bb2b5247dd2c4cff51c82b1257c871a4179	2017-04-26 13:40:13 -07:00
Aapo Kyrola	9215afef7d	Allow stopping of specific data workers + specify c2 queue size Summary: Now you can call coordinator.stop_coordinator("train") to stop the train model's data input and release its memory. Reviewed By: rpenggithub Differential Revision: D4955014 fbshipit-source-id: c1bc3ec67337b94aff8ea9b306c3b4158eeef42c	2017-04-26 11:18:40 -07:00
Aaron Markham	58f7f2b441	doxygen python block added Summary: Closes https://github.com/caffe2/caffe2/pull/226 Differential Revision: D4793550 Pulled By: JoelMarcey fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e	2017-03-29 06:46:16 -07:00
Jerry Pan	327d3cb2b5	Caffe2: add init method and metric logging to data loader Summary: Caffe2: add init method and metric logging to data loader Differential Revision: D4685665 fbshipit-source-id: c4e0a09ab6a90c26c329f731f261cba8af1d6bbd	2017-03-28 08:48:27 -07:00
Aapo Kyrola	aa3156c235	Remove use of logging module and np.random.randint() due to deadlocks with forks Summary: See http://bugs.python.org/issue6721. Since everstore loaders use ProcessPoolExecutor, which is based on forks, and there was perhaps update of the numpy library or some unralted lirbary, we started getting subprocesses stuck at np.random.randint(). Also changed logging to prints, since logging is known to have issues with multiprocessing. See https://www.prod.facebook.com/groups/fbpython/permalink/1438647216176641/ Differential Revision: D4633725 fbshipit-source-id: ae948a1827c71a3a2119d6a3248706728984df31	2017-03-01 03:32:56 -08:00
Aapo Kyrola	7b0126381c	Share queue + reduce logging Summary: It is better for the workers to share the python-side queue, since I saw a case where workers assigned for one GPU was lagging behind others. Also, reduced logging as requested by rpenggithub. Differential Revision: D4620487 fbshipit-source-id: 73353f9570b07788c8cd71c9fec9308cd93a44dd	2017-02-27 19:38:45 -08:00
Aapo Kyrola	449f8997ab	close blobs queues when stopping + test Summary: Mysterious deadlocks after epoch has finished have occured randomly but quite frequently recently for myself, vigneshr and others. Looking at a stack trace of vigneshr's job (P57129798), I noticed a couple of threads were calling BlobsQueue.blockingWrite (or something like that). That call stucks when the caffe2/c++ side queue is at capacity (we use capacity of 4 with data workers). So in cases when this call was just being made while the script was to be terminated, the thread did not close and the whole process did not close either (not completely sure why that is since thread is a daemon thread, but this might be a flow-related issue since we run inside a flow container). This is quite easy to fix: just call CloseBlobsQueue() when terminating the process. I modified coordinator.stop() and wait_for_finish() to return a status code based on whether threads that were joined actually closed within the 1.0sec timeout. This allowed creating an unit test to test for this issue. Before my change, the unit test failed. Reviewed By: pietern Differential Revision: D4619638 fbshipit-source-id: d96314ca783977517274fc7aadf8db4ee5636bdf	2017-02-27 10:07:57 -08:00
Aapo Kyrola	0a060dae50	better killing after timeout, cleanup Summary: This fixes at partly a recurrent problem when using everstore data input (or any other data input with multiprocessing). If the main process dies violently, the child processes are not killed. One cause for this was when using the TimeoutGuard(), as it called os._exit(1) that prevents any cleanup happening. I changed it to send SIGINT signal to the PID, and if in 10 secs the process is still living, calling os._exit(1). In my tests, this works well. Did some other cleanup: - improved logging of inputs/sec in data_workers - removed redundant atexit() handling as the multiprocessing pool does it itself Differential Revision: D4602550 fbshipit-source-id: 64d4526a2a3625d163d23f078286e719d56998f4	2017-02-23 13:16:19 -08:00
Luis Galeana	6b0545d764	Implemented logging of inputs per second Summary: Every time data is put into the logger, it checks if a second has passed. If so, it displays how many inputs were put in the last second. Differential Revision: D4527148 fbshipit-source-id: f197eb975ed81111449705e0719d1e56f385fd8d	2017-02-16 12:02:05 -08:00
Renbin Peng	7ca1c0e405	Add two data_loaders and refactor code Summary: (1) Add two dataloaders, everstore and squashfs (2) Refactor code Differential Revision: D4500365 fbshipit-source-id: f70fb40ca29cdbfb46da5f3f6322f2d953c01903	2017-02-11 02:13:36 -08:00
Aapo Kyrola	849fc7ba68	check that parameter is int Summary: One trainer passed (10,) as the max_buffer_size parameter, causing the internal queue to grow out of bounds as qsize == (10,) never was true. This adds assertion to the type of the parameter. Reviewed By: prigoyal Differential Revision: D4527649 fbshipit-source-id: 492a824700b8fc69c484b80773b1f1f5aee39071	2017-02-08 03:04:04 -08:00
Aapo Kyrola	6a03641cde	Add num_iters to RunNet() Summary: Running RunNet() in python in a loop can be a performance issue if the python code is doing a lot of other processing, such as data input, because python's Global Interpreter lock (GIL) will prevent the RunNet() to be called. This can easily be fixed by making RunNet() run multiple iterations inside the C++ land. (Another way to accomplish the same thing is to use Caffe2's "execution plans", but that requires more setup). + fixed timing reporting in my OC workflow + improved one error log in data_workers.py Sorry for piggypagging those small changes, but landing diffs currently is slow... Reviewed By: rpenggithub Differential Revision: D4523575 fbshipit-source-id: 039a647576efad5dd9afda74df478ac22b43c103	2017-02-07 14:16:14 -08:00
Aapo Kyrola	50213705d4	Allow specifying max buffer size. Smaller initial size. Summary: I recently encountered out-of-memory errors on my OC workflow. This was because the internal queue for buffering image patches was too large. Total memory use was: image size = 227 x 227 x 3 x 4 total mem = image size x queuesize (500) x num gpus x everstore-worker batch (128) > 300 gigs. Reducing the batch size to 100 should fix this. Also can now specify as a parameter. Reviewed By: rpenggithub Differential Revision: D4519956 fbshipit-source-id: 781697e620431ce7053534e683047bb6e7257b22	2017-02-06 22:01:56 -08:00
Aapo Kyrola	82f1a8e12d	fix code doc for data_workers Summary: Fix bug in doc as reported by rpenggithub Reviewed By: rpenggithub Differential Revision: D4356796 fbshipit-source-id: a35e54247d84ba29ef1b8e8cac0de8a3d30b489e	2016-12-21 09:29:43 -08:00
Aapo Kyrola	35fa9e9c5f	a couple small reliability improvements Summary: A couple of more misc changes: - allow starting the coordinator multiple times -- this makes data parallel programming easier - make the fetcher id a global sequence, before each gpu had same ids for workers - my flow jobs got stuck when joining the fetcher threads. I think there is actually a memory fencing problem with the is_active boolean. But I am too tired to add proper condition variables there. Instead just add timeout to join(). It is needed anyway since some i/o thread could get blocked. Differential Revision: D4333381 fbshipit-source-id: 88226c8a9c9a5e05d771360a502a2ba21a6b9d76	2016-12-15 21:29:29 -08:00
Aapo Kyrola	2bf18f2b1d	add inception and dummy input Summary: As requested by Yangqing, added Inception model (copied from convnet_benchmarks) and a dummy data feed option to the xray trainer, that we use for scalability benchmarking. + a couple of minichanges to the data input framework Reviewed By: Yangqing Differential Revision: D4327024 fbshipit-source-id: 86911468456fc13a32d5f437a43347380ec66a68	2016-12-15 13:40:22 -08:00
Aapo Kyrola	e80423f341	bug fix to distringuish train/test data Summary: We often use same net for training and testing, but we must distinguish their data. My yestterday's diff forgot to include that distinction (it was in the xray sampler before), and this diff adds it. Basically one provides a name for the input source for data_workers, and all the queues and scratch spaces are suffixed with that to separate them. Also specify the caffe2 queue's size to 4, which is empirically found to be sufficient. It was errorneously defined to be function of batch size, which does not make sense as each element in the queue is a batch, and led to out of memory issues on xray trainer. Differential Revision: D4329449 fbshipit-source-id: c994da1c8b0935b8eda2402c118d49b76caa7da8	2016-12-15 12:01:31 -08:00
Aapo Kyrola	0b52b3c79d	Generalize threaded data input via queues + Everstore input Summary: Xray sampler (originally by ajtulloch) and prigoyal's resnet trainer use variants of the threaded data input where worker threads put stuff into a python queue that is drained by an enqueuer thread that dumps those batches to a Caffe2 queue, that is then drained by the net's DequeueBlobs operator. There is a lot of boilerplate, which is also quite complicated. This diff is an attempt to generalize that general stuff under a new module "data_workers" (name could be improved). Basically you pass it a function that is able to return chunks of data (usually data + labels). I also created a module 'everstore_data_input' which generalizes everstore-origin data input with preprocessing function (image augmentation , for example). See how I refactored sampler.py for the usage. Next we could create fetcher function for Laser data. Differential Revision: D4297667 fbshipit-source-id: 8d8a863b177784ae13940730a27dc76cd1dd3dac	2016-12-15 12:01:30 -08:00

20 Commits