Commit Graph

38 Commits

Author SHA1 Message Date
Dmitrii Marin
8bd80a6b74 Fixed log message (#10874)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10874

Fixes the log message "WARNING:data_workers:Warning, data loading lagging behind: name=0" where instead of source name the size of a queue is reported

Reviewed By: panshen1, Novitial

Differential Revision: D9506606

fbshipit-source-id: 03717cfa9b991afb335ef877378afa3b52fd8f22
2018-09-05 09:55:52 -07:00
bddppq
f94ae3ba1d
Update from facebook (#7696)
* Fix handling of empty batches in SumReduceDimsOp

As titled

* Deferrable async_scheduling finishRun fix

Proper order of finishing run operations in deferrable_async_scheduling net

* Simplify exception handling in async_scheduling

Simplify exception handling, no need to busy wait, thread that processes the
last task can finish the run

* [C2]worker_coordinator_memorize_worker_ids

As titled. This is related to T28689868, where the number of blobs we want to create is equal to the number of worker ids

* Add unit test for nets with no type set

* Ignore total length argument in sympolic_pad_packed_sequence

1- There was a mistake in the code that total_length was added to the wrong symbolic function (pack_padded_sequence) instead of (pad_packed_sequence)
2- No need to throw an exception if total_length is given since it is only used to enable data_parallel training on multi-gpus and doesn't have anything to do with onnx export, so just ignore it. https://fburl.com/tk4gciqp

* Add support for MKLDNN to async_scheduling

Just add MKLDNN as a possible CPU option to async_scheduling's pool function

* [AuFL][ensemble] support branch output for prediction

This diff supports using predictions from different branches and thus enables model ensembling (not fully independent).

* Fix a bug in add_loss in layer_model_helper

As titled.

* Support lradaption for adam

1.lr adaption operator
2.apply to dense adam

* Perf tweaks for async_scheduling

Restore single pool option + remove unnecessary (no-ops) calls

* add quantization to SparseSimdAdagradOp

add a bunch of quantization signatures to SparseSimdAdagradOp, implementations to come next

* [sr] [codemod] Change all SR callsites to use new API

@allow-large-files

This diff refactors all callsites of SR to use the slightly changed API introduced in the diff below. Really what this means is that you need to include the correct header. Also if you were using `ClientFactory::newFactory` you need to not prefix it with `ClientFactory::`.

```
cd ~/fbsource/fbcode
find ./ -type f -exec sed -i -e 's:#include "servicerouter/client/cpp2/ClientFactory.h":#include "servicerouter/client/cpp2/ServiceRouter.h":' -e 's:#include <servicerouter/client/cpp2/ClientFactory.h>:#include <servicerouter/client/cpp2/ServiceRouter.h>:' -e 's/ClientFactory::newFactory(/newFactory(/g' {} \;
```

Also manually fixed spots that couldn't be done automatically (or broke because they depended on transitive includes).

* Back out "Fix handling of empty batches in SumReduceDimsOp"

Original commit changeset: 282da1730cc2 This commit is blocking the
Github->fbcode sync, which really needs to get merged ASAP. D7881937 which this
diff depends on will be reverted in the sync D7990948 which causes this to
break. The sync diff cannot be patched with this reversion because it must be
landed against base revision 5c8c099 , and D7881937 must not be included in the
sync diff because it is breaking GPU tests that are not available in sandcastle
: https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-cuda8.0-cudnn6-ubuntu16.04-test/3638/console
for one example.

* Add the flow to support operator benchmark

1) generate model with the operator 2) upload to everstore 3) generate model spec into json file 4) start running the benchmark

* [tum][gpu] Connect DPM trainer with flow and unit tests

This diff:
- Fix some small bugs for Yiming's recent changes to parallelizer, so it suits real use cases.
- Add correct tags to the TUM code, so we can do data parallel transform
- pass extra info when instantiation.
- add unit test for using DPM in TUM model

After this diff, we can do simple box, multi-gpu fully-sync trainer for TUM in Fblearner workflow, but may still need to do speed benchmarking.

* w/o normalized lradaption for adam dense only

The previous lr adaption includes a normalization step when performing the dot product operation. This is not exactly same as what is proposed in the paper. I add normalization as an option. Without it, the operator performs exactly what the paper proposed. With the option, we add the normalization step

* [fb] Use SharedPromise in DeferrableAsyncSchedulingNet

This code is to simplify DeferrableAsyncSchedulingNet by removing condition
variable + small fixes

* [tum] implement cuda sparseLengthsMean and LengthsMean

as title

* Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.

Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.

* Move feature_to_index to FeatureSpec.feature_to_index

move feature_to_index to FeatureSpec.feature_to_index to avoid override other fields

* [Caffe2] Rename bytes_moved to bytes_written

Just a rename in preparation for supporting bytes_read.

* [c2] fix ReduceFrontSumOp for empty case by setting 0

otherwise, it may use the results from last iteration when it's empty batch.

* [Caffe2] [Int8] Improve Intel CPU performance

* [Easy] Improve PrependDim op logging

as titled

* DBFileReader expand db_path using os.path.expanduser(..)

Since there are a lot of possible use cases of `DBFileReader` to read from user home path, like `~/local/sample.db`, I want to save people's trouble of calling `os.path.expanduser(db_path)` themselves.

* [Caffe2] Add bytes_read to cost structure

We're adding analytical read bytes to cost functions.  This extends the structure accordingly for all CostInference defined operators.
Additionally, some small bug fixes were performed:
1) Cost functions now extract type information of operands instead of assuming float

* Fix sleef on aarch64 for hhvm

@bypass-lint

Rename flag

* Remove duplicated part in caffe2/ideep/operators/conv_op.cc

should be sync error

* Rename test helper function test_adagrad_sparse_helper to adagrad_sparse_test_helper to avoid confusing pytest
2018-05-19 23:10:48 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Yongqiang Wang
0e99334efb move print to logger
Summary: further cleanup data_worker's messy output

Reviewed By: asaadaldien

Differential Revision: D6217857

fbshipit-source-id: 51cee29a687501d0f965422586fd6cb66a2d516a
2017-11-17 18:03:44 -08:00
Yongqiang Wang
db25f8602f Remove order by clause if it is not needed. Increasing timeout from 10mins to
Reviewed By: asaadaldien

Differential Revision: D6167599

fbshipit-source-id: 3e6bdd55d0aa5b497cc1871f237074b3b9ef6f29
2017-10-31 14:51:39 -07:00
Dmytro Dzhulgakov
2972a6ca02 Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger"
Summary:
This reverts commit 95c634872ac02be721257169e38c8fead04cd66b

bypass-lint

Differential Revision: D6026557

fbshipit-source-id: 663c28583ce3b01070ff5449115ed7e222f71776
2017-10-12 20:21:52 -07:00
Luke Yeager
75bece6ede Fix "No handlers could be found for logger"
Summary: Closes https://github.com/caffe2/caffe2/pull/1316

Differential Revision: D6026557

Pulled By: Yangqing

fbshipit-source-id: 95c634872ac02be721257169e38c8fead04cd66b
2017-10-10 22:32:13 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Kevin Wilfong
d072701547 Caffe2: Refactor the core logic from data_workers.py into parallel_workers.py
Summary:
data_workers.py provides a really nice, easy way to run background threads for data input.  Unfortunately, it's restrictive, the output of the fetcher function has to be a numpy array.

I pulled out that core nice thread management into parallel_workers, and updated the classes data_workers to extend those classes.  The main change was refactoring out most of the queue handling logic into QueueManager.

This way parallel_workers can be used to manage background threads without having to use the queue for output.

Reviewed By: akyrola

Differential Revision: D5538626

fbshipit-source-id: f382cc43f800ff90840582a378dc9b86ac05b613
2017-08-07 10:14:08 -07:00
Aapo Kyrola
84b9d267dc add warnings about slow data input
Summary: One of my workflows was stuck before everstore/hive data input was experiencing networking issues (No route to host etc.). But it is hard to know this is happening because the errors were logged to stdout. Anyway, added a simple logging to warn if the data workers enqueue thread is not getting new data for over 10 secs.

Reviewed By: panshen1

Differential Revision: D5522816

fbshipit-source-id: a036c4afdfbbafea130a4251c1ca02c138d19a83
2017-07-28 18:21:42 -07:00
Aapo Kyrola
f44991b398 add timeout argument to DequeueBlobs; use 10 min timeout for data workers
Summary: As title. This helps with (quite common) cases where data input is stuck for reason or another, and the net execution never proceeds and is stuck forever.

Reviewed By: andrewwdye

Differential Revision: D5409885

fbshipit-source-id: 840261fd5964408f788fc0f50ece0d74193694ac
2017-07-13 18:52:03 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Aapo Kyrola
87275817a4 fix a rare race condition by initializing scratch blobs beforehand
Summary: Data workers test timeouts randomly (very seldom), and looks like the reason is that we call FeedBlob in a thread (eneuque-thread), and first time that is called, it will call workspace.CreateBlob() -- which is not thread safe. Fix this by initializing the scratch blobs explicitly.

Reviewed By: panshen1

Differential Revision: D5292426

fbshipit-source-id: d7dad68f3ccc636c60bd82b2527f00f20da298b5
2017-06-26 10:18:18 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Aapo Kyrola
acb2ad12e5 fix race condition at terminate
Summary:
Looking at one segfault at exit (https://our.intern.facebook.com/intern/chronos/jobinstance/?jobinstanceid=911625597&smc=chronos_gp_admin_client&log_type=stderr&offset=0&pretty_logs=false) and it's coredump, only thing I can see that a FreeBlob() operator is called concurrently while a cudaMemcpyAsync (on thread 1) is crashing. FreeBlobOp is only called at data_workers _stop() (via utils.ResetBlobs()), and only code that could run a cudaMemcpyAsync that time is the fetcher -thread of data_workers that is enquing blobs.

Here are the stacks: P57455299

This is clearly a bug since we should only clear the scratch blobs after all threads are terminated, which happens at wait_for_finish().

I am not 100% sure this fixes all the segfaults, but at least this one was most likely caused by this.

Reviewed By: andrewwdye

Differential Revision: D5146278

fbshipit-source-id: ae00796706bfc4fee6823caf6529b62ab20c1cd3
2017-05-30 13:47:10 -07:00
Thomas Dudziak
ec19b4bd7b Import fixes for Python 3
Summary: As title

Differential Revision: D5135990

fbshipit-source-id: 88cb15bb2fb97dd21faf3ea5ddb8d4dbff7fad93
2017-05-26 16:31:50 -07:00
Zhicheng Yan
2002018603 memory_leak_data_worker
Summary: Memory leak happens when new BlobReference is constantly added to the set _scratch_blobs

Reviewed By: panshen1

Differential Revision: D5134945

fbshipit-source-id: 3ce4d482153bb89de065f20cd91411178085caad
2017-05-25 19:22:03 -07:00
Aapo Kyrola
74e964ff0d make data_workers restartable
Summary: Add ability to restart data workers data input.

Reviewed By: andrewwdye

Differential Revision: D5108666

fbshipit-source-id: f7f71cd6d4d45d007067814a552fc93cbe3eca42
2017-05-23 01:18:44 -07:00
Aapo Kyrola
1a831ce8f2 Add direct enqueuing to enable RNN input, allow specify batch columns
Summary:
Add a parameter dont_rebatch to data_workers. This disables batching of input from fetcher to equal-batch size chunks. This is not desired with RNNs where with longer sequence length we might want to have smaller batches etc.

For some reason the graceful-shutdown test interfered with other tests, so I removed it.

Reviewed By: jay-mahadeokar

Differential Revision: D4988549

fbshipit-source-id: cbab46d77c948f2e293e79e6eb538dde17d800ee
2017-05-03 14:49:44 -07:00
Aapo Kyrola
2c59f017e6 Port Xray OC workflow to elastic_data_parallel_model
Summary: As in the title + added scuba logging of the results.

Reviewed By: andrewwdye

Differential Revision: D4974261

fbshipit-source-id: 3e05b97133be95ffe37c8bcafd8a5a6bf3e7da93
2017-05-01 00:32:47 -07:00
Aapo Kyrola
6a1ef687f6 Free scratch blobs when data workers exits, add utility function to reset blobs
Summary:
Free scratch blobs at data workers exit. Also add utility function that you can use to reset gradient blobs easily:

    from caffe2.python import utils
    grad_blobs = [b for b in workspace.Blobs() if b.endswith("_grad") or b.endswith("_shared")]
    utils.ResetBlobs(grad_blobs)

Reviewed By: rpenggithub

Differential Revision: D4955531

fbshipit-source-id: d33b2bb2b5247dd2c4cff51c82b1257c871a4179
2017-04-26 13:40:13 -07:00
Aapo Kyrola
9215afef7d Allow stopping of specific data workers + specify c2 queue size
Summary: Now you can call coordinator.stop_coordinator("train") to stop the train model's data input and release its memory.

Reviewed By: rpenggithub

Differential Revision: D4955014

fbshipit-source-id: c1bc3ec67337b94aff8ea9b306c3b4158eeef42c
2017-04-26 11:18:40 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Jerry Pan
327d3cb2b5 Caffe2: add init method and metric logging to data loader
Summary: Caffe2: add init method and metric logging to data loader

Differential Revision: D4685665

fbshipit-source-id: c4e0a09ab6a90c26c329f731f261cba8af1d6bbd
2017-03-28 08:48:27 -07:00
Aapo Kyrola
aa3156c235 Remove use of logging module and np.random.randint() due to deadlocks with forks
Summary: See http://bugs.python.org/issue6721. Since everstore loaders use ProcessPoolExecutor, which is based on forks, and there was perhaps update of the numpy library or some unralted lirbary, we started getting subprocesses stuck at np.random.randint().   Also changed logging to prints, since logging is known to have issues with multiprocessing.  See https://www.prod.facebook.com/groups/fbpython/permalink/1438647216176641/

Differential Revision: D4633725

fbshipit-source-id: ae948a1827c71a3a2119d6a3248706728984df31
2017-03-01 03:32:56 -08:00
Aapo Kyrola
7b0126381c Share queue + reduce logging
Summary: It is better for the workers to share the python-side queue, since I saw a case where workers assigned for one GPU was lagging behind others. Also, reduced logging as requested by rpenggithub.

Differential Revision: D4620487

fbshipit-source-id: 73353f9570b07788c8cd71c9fec9308cd93a44dd
2017-02-27 19:38:45 -08:00
Aapo Kyrola
449f8997ab close blobs queues when stopping + test
Summary:
Mysterious deadlocks after epoch has finished have occured randomly but quite frequently recently for myself, vigneshr and others. Looking at a stack trace of vigneshr's job (P57129798), I noticed a couple of threads were calling BlobsQueue.blockingWrite (or something like that). That call stucks when the caffe2/c++ side queue is at capacity (we use capacity of 4 with data workers). So in cases when this call was just being made while the script was to be terminated, the thread did not close and the whole process did not close either (not completely sure why that is since thread is a daemon thread, but this might be a flow-related issue since we run inside a flow container).

This is quite easy to fix: just call CloseBlobsQueue() when terminating the process. I modified coordinator.stop() and wait_for_finish() to return a status code based on whether threads that were joined actually closed within the 1.0sec timeout. This allowed creating an unit test to test for this issue. Before my change, the unit test failed.

Reviewed By: pietern

Differential Revision: D4619638

fbshipit-source-id: d96314ca783977517274fc7aadf8db4ee5636bdf
2017-02-27 10:07:57 -08:00
Aapo Kyrola
0a060dae50 better killing after timeout, cleanup
Summary:
This fixes at partly a recurrent problem when using everstore data input (or any other data input with multiprocessing). If the main process dies violently, the child processes are not killed. One cause for this was when using the TimeoutGuard(), as it called os._exit(1) that prevents any cleanup happening. I changed it to send SIGINT signal to the PID, and if in 10 secs the process is still living, calling os._exit(1). In my tests, this works well.

Did some other cleanup:
- improved logging of inputs/sec in data_workers
- removed redundant atexit() handling as the multiprocessing pool does it itself

Differential Revision: D4602550

fbshipit-source-id: 64d4526a2a3625d163d23f078286e719d56998f4
2017-02-23 13:16:19 -08:00
Luis Galeana
6b0545d764 Implemented logging of inputs per second
Summary: Every time data is put into the logger, it checks if a second has passed. If so, it displays how many inputs were put in the last second.

Differential Revision: D4527148

fbshipit-source-id: f197eb975ed81111449705e0719d1e56f385fd8d
2017-02-16 12:02:05 -08:00
Renbin Peng
7ca1c0e405 Add two data_loaders and refactor code
Summary:
(1) Add two dataloaders, everstore and squashfs
(2) Refactor code

Differential Revision: D4500365

fbshipit-source-id: f70fb40ca29cdbfb46da5f3f6322f2d953c01903
2017-02-11 02:13:36 -08:00
Aapo Kyrola
849fc7ba68 check that parameter is int
Summary: One trainer passed (10,) as the max_buffer_size parameter, causing the internal queue to grow out of bounds as qsize == (10,) never was true. This adds assertion to the type of the parameter.

Reviewed By: prigoyal

Differential Revision: D4527649

fbshipit-source-id: 492a824700b8fc69c484b80773b1f1f5aee39071
2017-02-08 03:04:04 -08:00
Aapo Kyrola
6a03641cde Add num_iters to RunNet()
Summary:
Running RunNet() in python in a loop can be a performance issue if the python code is doing a lot of other processing, such as data input, because python's Global Interpreter lock (GIL) will prevent the RunNet() to be called. This can easily be fixed by making RunNet() run multiple iterations inside the C++ land. (Another way to accomplish the same thing is to use Caffe2's "execution plans", but that requires more setup).

+ fixed timing reporting in my OC workflow
+ improved one error log in data_workers.py

Sorry for piggypagging those small changes, but landing diffs currently is slow...

Reviewed By: rpenggithub

Differential Revision: D4523575

fbshipit-source-id: 039a647576efad5dd9afda74df478ac22b43c103
2017-02-07 14:16:14 -08:00
Aapo Kyrola
50213705d4 Allow specifying max buffer size. Smaller initial size.
Summary:
I recently encountered out-of-memory errors on my OC workflow. This was because the internal queue for buffering image patches was too large. Total memory use was:
  image size = 227 x 227 x 3 x 4
  total mem = image size x queuesize (500) x num gpus x everstore-worker batch (128) > 300 gigs.

Reducing the batch size to 100 should fix this. Also can now specify as a parameter.

Reviewed By: rpenggithub

Differential Revision: D4519956

fbshipit-source-id: 781697e620431ce7053534e683047bb6e7257b22
2017-02-06 22:01:56 -08:00
Aapo Kyrola
82f1a8e12d fix code doc for data_workers
Summary: Fix bug in doc as reported by rpenggithub

Reviewed By: rpenggithub

Differential Revision: D4356796

fbshipit-source-id: a35e54247d84ba29ef1b8e8cac0de8a3d30b489e
2016-12-21 09:29:43 -08:00
Aapo Kyrola
35fa9e9c5f a couple small reliability improvements
Summary:
A couple of more misc changes:
- allow starting the coordinator multiple times -- this makes data parallel programming easier
- make the fetcher id a global sequence, before each gpu had same ids for workers
- my flow jobs got stuck when joining the fetcher threads. I think there is actually a memory fencing problem with the is_active boolean. But I am too tired to add proper condition variables there. Instead just add timeout to join(). It is needed anyway since some i/o thread could get blocked.

Differential Revision: D4333381

fbshipit-source-id: 88226c8a9c9a5e05d771360a502a2ba21a6b9d76
2016-12-15 21:29:29 -08:00
Aapo Kyrola
2bf18f2b1d add inception and dummy input
Summary:
As requested by Yangqing, added Inception model (copied from convnet_benchmarks) and a dummy data feed option to the xray trainer, that we use for scalability benchmarking.

+ a couple of minichanges to the data input framework

Reviewed By: Yangqing

Differential Revision: D4327024

fbshipit-source-id: 86911468456fc13a32d5f437a43347380ec66a68
2016-12-15 13:40:22 -08:00
Aapo Kyrola
e80423f341 bug fix to distringuish train/test data
Summary:
We often use same net for training and testing, but we must distinguish their data. My yestterday's diff forgot to include that distinction (it was in the xray sampler before), and this diff adds it. Basically one provides a name for the input source for data_workers, and all the queues and scratch spaces are suffixed with that to separate them.

Also specify the caffe2 queue's size to 4, which is empirically found to be sufficient. It was errorneously defined to be function of batch size, which does not make sense as each *element* in the queue is a batch, and led to out of memory issues on xray trainer.

Differential Revision: D4329449

fbshipit-source-id: c994da1c8b0935b8eda2402c118d49b76caa7da8
2016-12-15 12:01:31 -08:00
Aapo Kyrola
0b52b3c79d Generalize threaded data input via queues + Everstore input
Summary:
Xray sampler (originally by ajtulloch) and prigoyal's resnet trainer use variants of the threaded data input where worker threads put stuff into a python queue that is drained by an enqueuer thread that dumps those batches to a Caffe2 queue, that is then drained by the net's DequeueBlobs operator.

There is a lot of boilerplate, which is also quite complicated.

This diff is an attempt to generalize that general stuff under a new module "data_workers" (name could be improved). Basically you pass it a function that is able to return chunks of data (usually data + labels).

I also created a module 'everstore_data_input' which generalizes everstore-origin data input with preprocessing function (image augmentation , for example). See how I refactored sampler.py for the usage.

Next we could create fetcher function for Laser data.

Differential Revision: D4297667

fbshipit-source-id: 8d8a863b177784ae13940730a27dc76cd1dd3dac
2016-12-15 12:01:30 -08:00