Summary:
I have noticed that constructing the Xray model takes quite a while. To measure this, I wrote a benchmark script that creates a resnet-50 model on 8 gpus. This takes about 95 secs -- which is kind of annoying when you want to quickly debug stuff.
Profiling (using Python's cProfile), I was able to see that the most of the time is used in net.BlobIsDefined(), which does a linear search over external inputs and operator outputs. Thus it gets slower and slower with large nets. This can be fully optimized by keeping a separate lookup table of operator inputs and outputs (and external inputs and outputs). It is a bit annoying to keep this separate data structure, but I setup the unit tests to ensure things are doing correctly over Clones.
After the optimization, the net construction drops from 95 secs to 8.2 secs!
Reviewed By: azzolini
Differential Revision: D4288307
fbshipit-source-id: 0bb82c8bde9d86a2702b298f4aa706cba509346e
Summary:
Disclaimer: this is really hacky
Continues a fix from D4218902. The root problem is that DPER builds net incrementally and input_record doesn't support it properly. For not I just manipulate the input record directly. Alisson wants to fix it properly later by allowing set_input_record to accept a superset of current record.
But it should unblock our experimentation.
I'm curious how it's going to look in dper_example world.
Reviewed By: azzolini
Differential Revision: D4255285
fbshipit-source-id: ff65b6f943d705a9b3399035597e2e8ded2e1ff3
Summary:
This adds support for automatic aggregation of sparse gradients. We simply concatenate indices and values (no attempt to deduplicate, since this is already done before feeding into the optimizer). This should support various cases (indices and/or values can be generated by one or more gradient ops, or gradient outputs can be directly passed from inputs).
I tried to minimize the code footprint, but I introduced SparseGradGenMeta because GradGenMeta didn't lend itself very well to be used with sparse gradients.
Reviewed By: dzhulgakov
Differential Revision: D4219788
fbshipit-source-id: 1d074664cffd82a8764e4b1473ada6bc46e6c51a
Summary:
DPER has very strange python ops that play with Workspace - they are somewhat similar to LoadOp/SaveOp, so I guess the semantics is fine.
Thus it makes sense to allow python operators to receive workspace pointer similarly to regular Operators.
I didn't figure out a better way to implement optional argument than just checking the number of args function receives on python side.
Reviewed By: ajtulloch
Differential Revision: D4242943
fbshipit-source-id: d97d4227815b741c8f884cfe254b06d2b56b5a41
Summary:
Faster implementation of UniqueOp using google::dense_hash_map, as suggested by dzhulgakov. I haven't benchmarked it precisely but early measurements with my workflow show a significant speed bump (this operation went from using 20% of overall CPU time down to 7%).
I gated the implementation using the "engine" feature, to avoid adding sparsehash as a dependency to caffe2.
Reviewed By: dzhulgakov
Differential Revision: D4219768
fbshipit-source-id: 2f142981e772105b42fffa24afb199ef816f8e0c