Summary: This diff enables sparse gradient synchronization between GPUs. The test case is now a bit too convoluted, but once D4871680 is landed, we can simplify it a bit.
Reviewed By: dzhulgakov
Differential Revision: D4877087
fbshipit-source-id: 37bbb07051cbaf3a6e3c54b0eead97f3e02337d5
Summary:
This diff enables support of recurrent networks for memonger:
1. Memonger descends into the step-nets and renames the blobs accordingly
2. Memonger tells the gradient op about the renamed blobs by adding a parameter "paramname.renamed=<new name>"
3. RecurrentNetworkGradientOp applies remapping to links and gradient blobs.
I first thought of refactoring the whole gradient blob management of the recurrent network, but that looks to be very hard without a major revise of the code.
Note, I did not enable memonger for neural_mt, since I think the team should do more testing before enabling this.
Reviewed By: salexspb
Differential Revision: D4812823
fbshipit-source-id: 1ffdf3cfb4fcd00eec5bb0ece3bf416aa6d3e26b
Summary:
First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made:
- cell net/step net external inputs must be namespace scoped
- prevent double-namescoping of cellnet inputs
- make data parallel model understand recurrentnets so the device-mapping works
Reviewed By: salexspb
Differential Revision: D4708840
fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4
Summary:
Data parallel model did not support sparse operations, nor gradients computed on CPU ops.
Currently sparse operations are done on CPU, so there is no point of "data parallelizing" them. I had to make a few changes to data_parallel_model to support this:
1. Model can have params that are added prior to adding the data parallel part. For example, a lookup table of word vectors would be a parameter that is non-parallel.
2. Thus, when data parallel model is called, it will separate the non-parallel params and avoid working on them. Note: when we add distributed version, we need to explicitly handle them with AllGather!
This works nicely since Caffe2 automatically adds the backward concat-operator when multiple ops gather from the same blob.
I also added support for data parallel CPU ops, which might be necessary in cases when we don't have GPU implemenation of some ops.
Test in data_parallel_model_test validates the correctness of the code by running the same trainer on different number of gpus and checking the end result is same.
Reviewed By: jhcross
Differential Revision: D4649208
fbshipit-source-id: e3b7ae701ead468dc94c52a976eafec5c9831097
Summary:
As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/.
So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this.
In this diff I modified all my models to work correctly.
Reviewed By: Yangqing
Differential Revision: D4507002
fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.
Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.
Reviewed By: prigoyal
Differential Revision: D4248907
fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be