Commit Graph

19 Commits

Author SHA1 Message Date
Huazhong Ning
47bd606f63 Better visualization for gpu training plan
Summary:
The current gpu training plan has many sub-steps with same name (eg., "train/epoch"). This messes up the plan visualization. This diff fixes this.

before: https://our.intern.facebook.com/intern/graphviz?paste=56899036
after: https://our.intern.facebook.com/intern/graphviz?paste=56899704

Reviewed By: xianjiec

Differential Revision: D4343739

fbshipit-source-id: 8dbc01b4f3221999c78cb80a22ec8c11abf81172
2016-12-21 09:29:43 -08:00
Yury Zemlyanskiy
c2d28fb874 RNNs API simplification
Summary:
This is a first step in improving our RNN story. It provides a wrapper around current RecurrentNetworkOp implementation which infers most of the redundant parameters and makes API much simpler.

Also in order to support general step nets I added an extra argument to the RecurrentNetworkOp.

Future work:

1. Inferring step net output and internal blobs (scratches) sizes and type
2. Avoid accessing blobs by names in c++ part
3. Remove requirement for inputs / output 1:1 correspondence in the step net
4. Make python API support networks with operators like Sum being on the boarder of the Cell net (currently there is an issue with such networks where gradient blobs which are on the side are not explicitly created).

Differential Revision: D4268503

fbshipit-source-id: f8a66491c2b55daa730caeed7e9f2b3921541b49
2016-12-21 09:29:43 -08:00
Huazhong Ning
70dcba376c using BlobReference for Sum gradients.
Summary:
We create a Sum operator to sum up the gradients. Currently we use strings for its input/output blobs.
So the code will fail if AddAllGradients() runs within a NameScope.
To avoid this, just BlobReference instead of string for blobs.

Reviewed By: xianjiec

Differential Revision: D4343701

fbshipit-source-id: 2d008916e192d75c6e20f97921331ac4c7b73363
2016-12-18 09:29:22 -08:00
Aapo Kyrola
d38499f727 Optimize BlobIsDefined() + benchmark --> net construction 95 secs to 8.2 secs!
Summary:
I have noticed that constructing the Xray model takes quite a while. To measure this, I wrote a benchmark script that creates a resnet-50 model on 8 gpus. This takes about 95 secs -- which is kind of annoying when you want to quickly debug stuff.

Profiling (using Python's cProfile), I was able to see that the most of the time is used in net.BlobIsDefined(), which does a linear search over external inputs and operator outputs. Thus it gets slower and slower with large nets.  This can be fully optimized by keeping a separate lookup table of operator inputs and outputs (and external inputs and outputs). It is a bit annoying to keep this separate data structure, but I setup the unit tests to ensure things are doing correctly over Clones.

After the optimization, the net construction drops from 95 secs to 8.2 secs!

Reviewed By: azzolini

Differential Revision: D4288307

fbshipit-source-id: 0bb82c8bde9d86a2702b298f4aa706cba509346e
2016-12-15 12:01:30 -08:00
Dmytro Dzhulgakov
3125e6a821 Hacky fix for cloned model rewriting
Summary:
Disclaimer: this is really hacky

Continues a fix from D4218902. The root problem is that DPER builds net incrementally and input_record doesn't support it properly. For not I just manipulate the input record directly. Alisson wants to fix it properly later by allowing set_input_record to accept a superset of current record.

But it should unblock our experimentation.

I'm curious how it's going to look in dper_example world.

Reviewed By: azzolini

Differential Revision: D4255285

fbshipit-source-id: ff65b6f943d705a9b3399035597e2e8ded2e1ff3
2016-12-05 11:53:26 -08:00
Martin Raison
ea9a0f24bf automatic aggregation of sparse gradients
Summary:
This adds support for automatic aggregation of sparse gradients. We simply concatenate indices and values (no attempt to deduplicate, since this is already done before feeding into the optimizer). This should support various cases (indices and/or values can be generated by one or more gradient ops, or gradient outputs can be directly passed from inputs).

I tried to minimize the code footprint, but I introduced SparseGradGenMeta because GradGenMeta didn't lend itself very well to be used with sparse gradients.

Reviewed By: dzhulgakov

Differential Revision: D4219788

fbshipit-source-id: 1d074664cffd82a8764e4b1473ada6bc46e6c51a
2016-12-05 11:53:26 -08:00
Dmytro Dzhulgakov
119b687994 Allow PythonOp to access the workspace
Summary:
DPER has very strange python ops that play with Workspace - they are somewhat similar to LoadOp/SaveOp, so I guess the semantics is fine.

Thus it makes sense to allow python operators to receive workspace pointer similarly to regular Operators.

I didn't figure out a better way to implement optional argument than just checking the number of args function receives on python side.

Reviewed By: ajtulloch

Differential Revision: D4242943

fbshipit-source-id: d97d4227815b741c8f884cfe254b06d2b56b5a41
2016-12-05 11:53:26 -08:00
Martin Raison
da72658fa8 sparsehash-based implementation of UniqueOp
Summary:
Faster implementation of UniqueOp using google::dense_hash_map, as suggested by dzhulgakov. I haven't benchmarked it precisely but early measurements with my workflow show a significant speed bump (this operation went from using 20% of overall CPU time down to 7%).

I gated the implementation using the "engine" feature, to avoid adding sparsehash as a dependency to caffe2.

Reviewed By: dzhulgakov

Differential Revision: D4219768

fbshipit-source-id: 2f142981e772105b42fffa24afb199ef816f8e0c
2016-11-29 15:18:39 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
d1e9215184 fbsync 2016-10-07 13:08:53 -07:00
Yangqing Jia
0a09d09431 fbsync 2016-09-08 17:56:14 -07:00
Yangqing Jia
b23e51d467 chunky sync 2016-09-06 15:55:19 -07:00
Yangqing Jia
05512d1e10 sync 2016-08-10 11:02:15 -07:00
Yangqing Jia
c15e45c9bb chunky sync again 2016-08-01 20:58:46 -07:00
Yangqing Jia
bcea409c82 sync 2016-07-28 15:06:43 -07:00
Yangqing Jia
6463eebc7b chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
Yangqing Jia
559053d3a8 chunky sync 2016-05-13 14:43:48 -07:00
Yangqing Jia
cf7ca23fc1 make caffe2.python build 2016-03-08 16:48:19 -08:00
Yangqing Jia
9ae880bb6f move pycaffe2 to caffe2.python 2016-03-08 15:45:30 -08:00