pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	f94f43fd6e	Working sparse gradients for data parallel model Summary: This diff enables sparse gradient synchronization between GPUs. The test case is now a bit too convoluted, but once D4871680 is landed, we can simplify it a bit. Reviewed By: dzhulgakov Differential Revision: D4877087 fbshipit-source-id: 37bbb07051cbaf3a6e3c54b0eead97f3e02337d5	2017-04-13 17:39:23 -07:00
Aapo Kyrola	02f0c1c9d7	make memonger work with RecurrentNetwork(Gradient) Summary: This diff enables support of recurrent networks for memonger: 1. Memonger descends into the step-nets and renames the blobs accordingly 2. Memonger tells the gradient op about the renamed blobs by adding a parameter "paramname.renamed=<new name>" 3. RecurrentNetworkGradientOp applies remapping to links and gradient blobs. I first thought of refactoring the whole gradient blob management of the recurrent network, but that looks to be very hard without a major revise of the code. Note, I did not enable memonger for neural_mt, since I think the team should do more testing before enabling this. Reviewed By: salexspb Differential Revision: D4812823 fbshipit-source-id: 1ffdf3cfb4fcd00eec5bb0ece3bf416aa6d3e26b	2017-04-05 09:48:25 -07:00
Aapo Kyrola	91f468b15c	fixes to make data parallel model work for RecurrentNet + test case Summary: First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made: - cell net/step net external inputs must be namespace scoped - prevent double-namescoping of cellnet inputs - make data parallel model understand recurrentnets so the device-mapping works Reviewed By: salexspb Differential Revision: D4708840 fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4	2017-03-14 15:48:07 -07:00
Aapo Kyrola	89c08334bb	data_parallel_model support for sparse gradients and CPU ops Summary: Data parallel model did not support sparse operations, nor gradients computed on CPU ops. Currently sparse operations are done on CPU, so there is no point of "data parallelizing" them. I had to make a few changes to data_parallel_model to support this: 1. Model can have params that are added prior to adding the data parallel part. For example, a lookup table of word vectors would be a parameter that is non-parallel. 2. Thus, when data parallel model is called, it will separate the non-parallel params and avoid working on them. Note: when we add distributed version, we need to explicitly handle them with AllGather! This works nicely since Caffe2 automatically adds the backward concat-operator when multiple ops gather from the same blob. I also added support for data parallel CPU ops, which might be necessary in cases when we don't have GPU implemenation of some ops. Test in data_parallel_model_test validates the correctness of the code by running the same trainer on different number of gpus and checking the end result is same. Reviewed By: jhcross Differential Revision: D4649208 fbshipit-source-id: e3b7ae701ead468dc94c52a976eafec5c9831097	2017-03-09 13:48:41 -08:00
Aapo Kyrola	1c7886701e	lr_scale to loss_scale Summary: As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/. So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this. In this diff I modified all my models to work correctly. Reviewed By: Yangqing Differential Revision: D4507002 fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90	2017-02-03 07:44:40 -08:00
Aapo Kyrola	3410939459	pass learning rate scaling factor to parameter update builder function Summary: When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus. Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size. Reviewed By: prigoyal Differential Revision: D4248907 fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be	2016-12-05 11:53:26 -08:00
Yangqing Jia	238ceab825	fbsync. TODO: check if build files need update.	2016-11-15 00:00:46 -08:00
Yangqing Jia	d1e9215184	fbsync	2016-10-07 13:08:53 -07:00

8 Commits