Summary:
Fix issue that amyzhang encountered. She was using ConstantFill to create a blob of same size as an another blob. This caused the gradient op computation flow to interrupt through the ConstantFil since the gradient for the input blob was set to None (although it had another gradient already set). The correct solution is to avoid overwriting gradient assignments with None, if they already have a gradient. UNLESS that blob is output of the same op, as with StopGradient op. (Note that Amy's problem was fixed by using instead a fixed shape ConstantFill and Add with broadcast=1, which is better solution anyway).
Not sure if I explained this well, but see the new unit tests. Before this change, the testAddAndDynamicConstant failed but the testAddAndStaticConstant succeeded.
Reviewed By: dzhulgakov
Differential Revision: D4861176
fbshipit-source-id: 3b53621bfaba2e36786a5e4664145038995f6616
Summary:
Quite large diff to make cuDNN LSTM and our LSTM produce same results and provide python API for the cuDNN LSTM.
* Added operators RecurrentParamGet and RecurrentParamSet to access weights and biases for the different gates, input/recurrent.
* Removed RecurrentInit as not needed
* recurrent.cudnn_LSTM() returns a special net and mapping that can be used to retrieve the parameters from the LSTM
* recurrent.cudnn_LSTM() can be passed blobs that have the parameters for the individual gate weights and biases
* recurrnet.InitFromLSTMParams() can be used to initialize our own LSTM from CUDNN params. This way we can test if cuDNN and our own produce the same result.
recurrent_test.py tests for the equivalency
Reviewed By: salexspb
Differential Revision: D4654988
fbshipit-source-id: 6c1547d873cadcf33e03b0e0110248f0a7ab8cb0
Summary:
First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made:
- cell net/step net external inputs must be namespace scoped
- prevent double-namescoping of cellnet inputs
- make data parallel model understand recurrentnets so the device-mapping works
Reviewed By: salexspb
Differential Revision: D4708840
fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4
Summary: When cloning recurrent net op, we do a remapping of the lengths-blobs. But if they don't exists (like with CRF), we should not do that.
Differential Revision: D4702123
fbshipit-source-id: 37a22d11e709011b8b98b2cc3d9f08eb9fda06c4
Summary:
This diff is modifying the way we're specifying metrics from doing reporter, that should know all the blobs which is should access in advance, to reporter that is connected through schema.
This diff is also reporting any arbitrary number of learning curves to Flow and provides really flexible way to specify all the metrics we care about.
TODO: Modify model helper to allow providing intermediate results for reporting
TODO: Add evaluation net (instead of prediction net).
TODO: Move all other places in DPER 2.0 to use that abstractions instead.
TODO: Get rid of LogScoreEstimator in favor of metric that is going to be really suiting our needs.
Reviewed By: azzolini, dzhulgakov, kittipatv
Differential Revision: D4577548
fbshipit-source-id: 3515bd41e0f92263ff90ce2f7207abf65d01b1f7
Summary: so that the utils can be used by a wider range of audience.
Reviewed By: xianjiec
Differential Revision: D4637462
fbshipit-source-id: f0695f430902aef26360efa511069b3755eaf52a
Summary: fix a check if the net is net_dict
Reviewed By: kennyhorror
Differential Revision: D4647493
fbshipit-source-id: e0a62fc5847c99c85857c5635b4e39d59c66d5ce
Summary: Add SparseNN workflow for feed. I haven't fully thought about the change needed for ads, as I added a property called 'preproc_output_schema' for LayerModelHelper.
Reviewed By: xianjiec
Differential Revision: D4585796
fbshipit-source-id: 060d08f4beb928e7e7863f2e563f612c358951fb
Summary:
For code in layer model helper, layers. It's intentionally to not have NameScope by default.
This looks another place that may need default NameScope.
https://fburl.com/wdwtxp0m
Reviewed By: kennyhorror
Differential Revision: D4606971
fbshipit-source-id: b560bf59d3242e3f9443cd5aeda5c7e2e4e89079
Summary:
Previously we had several limitations for a reporter net:
- needed to be a net, not an execution step
- only one allowed per execution step, with a single interval
Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.
Reviewed By: dzhulgakov
Differential Revision: D4583686
fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
Summary:
Remove the use of `NextName` in layer model helper, so that the same function return `model_helper` that should construct identical `Net`, when under the same NameScope.
The `NextScopedBlob` should only take effect when there is real name conflicting, otherwise it returns ScopedBlobReference.
This is critical for parameter blobs. In long run, we need to be able to specify parameter blobs more explicitly. (kennyhorror is working on this). This solution works in short term for e.g., two tower sparse nn models.
Reviewed By: kennyhorror
Differential Revision: D4555423
fbshipit-source-id: 2c4b99a61392e5d51aa878f7346466a8f14be187
Summary:
- NetBuilder now honors its name
- When Nets are created in the context of a NetBuilder, they take NetBuilder's name as prefix
- When a NetBuilder is created in the context of a Task, it takes the Tasks's name.
- pipe() now tries to find a good name based on its processor's, output or input queue's name.
- RPC tries to find a name from its handler's name.
- Better names in DataStream
- net_printer prints the name of Tasks and Steps
- net_printer optionally factors out common prefixes form blob names.
Differential Revision: D4527578
fbshipit-source-id: 5d3d1237c186e9576313c5aa01cc8800a9051217
Summary: This should not be needed any more since we use pybind. It will help python3 migration.
Reviewed By: salexspb
Differential Revision: D4535490
fbshipit-source-id: a47615f73b5c35b940d21bb2d5d55060fa0850be
Summary: See distributed.py for example of usage
Reviewed By: xianjiec
Differential Revision: D4467723
fbshipit-source-id: c74f71bebaa1751098379838d3da55945aac62bd
Summary:
Using multiple readers for model evaluation. Since it is built by new framework, only NativeLoader is supported.
With 5 readers, the evaluation speed is 124k. The speed for single evaluator is 32k. There is still room for improvement since the evaluator machine is under-utilized.
(Hive is the bottleneck. Adding more loading threads help to improve the speed to 240k. More readers can improve it further.)
Reviewed By: azzolini
Differential Revision: D4469393
fbshipit-source-id: b55af5f798faca4c150b2c0663fe5db0f154cb70
Summary:
It's a similar trick to dyndeps. The idea is that global state is better to be just replicated to gang workers as otherwise it causes a lot of confusion.
In particular it's useful if one wants to enable detailed logging (--v)
For other operators user still needs to call GlobalInit explicitly. We should consider doing it for all Flow operators, but I'll leave it for future considerations.
Reviewed By: kennyhorror
Differential Revision: D4460686
fbshipit-source-id: 5836737dd3195f9ad12589fd899a3ff63f173e05
Summary:
Perf bug report: https://www.facebook.com/groups/1405155842844877/permalink/1617904561570003/
Diagnosis:
I've done some digging into this and here's what I've found:
(1) In this use case, the call is disallowed_op_ids = get_op_ids_in_path(ssa, blob_versions, [], inputs)) where inputs = ['res4_22_sum'] is the last blob produced by the res4 stage of a ResNet101 model.
(2) get_op_ids_in_path has exponential running time in the number of blocks in the res4 stage of ResNet. This is based on empirical running times. This call should complete in 4.5 days on my devgpu.
(3) I haven't familiarized myself enough with the IR and SSA code in core.py to understand the algorithmic fix yet, but surely there's a more efficient algorithm to compute the same thing.
Reviewed By: Yangqing
Differential Revision: D4446278
fbshipit-source-id: 8bd147f92d62b865dc355d5802a53e92d64b6e21
Summary:
this normalizes the sparse gradient, so that the "effective learning rate" of each sparse parameter will NOT be affected by the number of examples in a batch that "use" this sparse parameter.
experiment shows it help convergence (about 0.1% better train NE): https://fburl.com/1230747813683956. It's not conclusive yet, and we still need to do more experiments. But this diff adds it as an option, and does not change the default behavior, so we can get this in first.
Differential Revision: D4367283
fbshipit-source-id: 49ea80dfa9ea776ff4160e220cf6c86593521607
Summary:
This is a first step in improving our RNN story. It provides a wrapper around current RecurrentNetworkOp implementation which infers most of the redundant parameters and makes API much simpler.
Also in order to support general step nets I added an extra argument to the RecurrentNetworkOp.
Future work:
1. Inferring step net output and internal blobs (scratches) sizes and type
2. Avoid accessing blobs by names in c++ part
3. Remove requirement for inputs / output 1:1 correspondence in the step net
4. Make python API support networks with operators like Sum being on the boarder of the Cell net (currently there is an issue with such networks where gradient blobs which are on the side are not explicitly created).
Differential Revision: D4268503
fbshipit-source-id: f8a66491c2b55daa730caeed7e9f2b3921541b49
Summary:
We create a Sum operator to sum up the gradients. Currently we use strings for its input/output blobs.
So the code will fail if AddAllGradients() runs within a NameScope.
To avoid this, just BlobReference instead of string for blobs.
Reviewed By: xianjiec
Differential Revision: D4343701
fbshipit-source-id: 2d008916e192d75c6e20f97921331ac4c7b73363
Summary:
I have noticed that constructing the Xray model takes quite a while. To measure this, I wrote a benchmark script that creates a resnet-50 model on 8 gpus. This takes about 95 secs -- which is kind of annoying when you want to quickly debug stuff.
Profiling (using Python's cProfile), I was able to see that the most of the time is used in net.BlobIsDefined(), which does a linear search over external inputs and operator outputs. Thus it gets slower and slower with large nets. This can be fully optimized by keeping a separate lookup table of operator inputs and outputs (and external inputs and outputs). It is a bit annoying to keep this separate data structure, but I setup the unit tests to ensure things are doing correctly over Clones.
After the optimization, the net construction drops from 95 secs to 8.2 secs!
Reviewed By: azzolini
Differential Revision: D4288307
fbshipit-source-id: 0bb82c8bde9d86a2702b298f4aa706cba509346e
Summary:
Disclaimer: this is really hacky
Continues a fix from D4218902. The root problem is that DPER builds net incrementally and input_record doesn't support it properly. For not I just manipulate the input record directly. Alisson wants to fix it properly later by allowing set_input_record to accept a superset of current record.
But it should unblock our experimentation.
I'm curious how it's going to look in dper_example world.
Reviewed By: azzolini
Differential Revision: D4255285
fbshipit-source-id: ff65b6f943d705a9b3399035597e2e8ded2e1ff3
Summary:
This adds support for automatic aggregation of sparse gradients. We simply concatenate indices and values (no attempt to deduplicate, since this is already done before feeding into the optimizer). This should support various cases (indices and/or values can be generated by one or more gradient ops, or gradient outputs can be directly passed from inputs).
I tried to minimize the code footprint, but I introduced SparseGradGenMeta because GradGenMeta didn't lend itself very well to be used with sparse gradients.
Reviewed By: dzhulgakov
Differential Revision: D4219788
fbshipit-source-id: 1d074664cffd82a8764e4b1473ada6bc46e6c51a
Summary:
DPER has very strange python ops that play with Workspace - they are somewhat similar to LoadOp/SaveOp, so I guess the semantics is fine.
Thus it makes sense to allow python operators to receive workspace pointer similarly to regular Operators.
I didn't figure out a better way to implement optional argument than just checking the number of args function receives on python side.
Reviewed By: ajtulloch
Differential Revision: D4242943
fbshipit-source-id: d97d4227815b741c8f884cfe254b06d2b56b5a41
Summary:
Faster implementation of UniqueOp using google::dense_hash_map, as suggested by dzhulgakov. I haven't benchmarked it precisely but early measurements with my workflow show a significant speed bump (this operation went from using 20% of overall CPU time down to 7%).
I gated the implementation using the "engine" feature, to avoid adding sparsehash as a dependency to caffe2.
Reviewed By: dzhulgakov
Differential Revision: D4219768
fbshipit-source-id: 2f142981e772105b42fffa24afb199ef816f8e0c