pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	eddf23ca0f	Handle parameters that are computed but not optimized Summary: prigoyal sharply noticed a bug in the Resnet models: we have not been checkpointing, nor synchronizing between gpus, the moving average and variance computed by the SpatialBN ops. Particularly the first problen is serious, since models starting from checkpoint would have started from a null-state for SpatialBN. Not synchronizing with the data parallel model is less tragic since each GPU should see very similar data. Thus I propose keeping track of "computed params", i.e params that are computed from data but not optimized. I don't know if there are other examples, but SpatialBN's moving avg and var definitely are one. - I modified the checkpointign for xray model to store those blobs + also ensure the synchronization of those blobs - I modified data parallel model to broadcast those params from gpu0. I first tried averaging, but hit some NCCL deadlocks ... :( Differential Revision: D4281265 fbshipit-source-id: 933311afeec4b7e9344a13cf2d38aa939c50ac31	2016-12-15 12:01:28 -08:00
Ou Jin	e8b7ec1393	disable local update for sparse features Summary: With parameter server, sparse features are updated on the parameter server. Local update for sparse features are disabled. But that logic is removed in D4144922. This diff is to add this logic back in a slightly different way. Previously, in trainer_example, I did that in a hacky way just avoid adding sparse weight to model.params. It will still generate grad, but will not add optimization operators. At the same time, it is always registered directly in the sparse_mapping, so the parameter server is aware of this parameter. But with the new change for ParameterInfo. I can not do it in that way anymore. Because the param registry and params are bind together in ParameterInfo. For dper, there is a option in dper model helper to disable all of the sparse parameter optimizer. To combine these two together, I directly changed the ModelHelperBase in this diff. It is not quite ideal. It is better to do it in Layer. But to fix the old one, this seems to be more reasonable place to cover both cases. With this diff, there is no spike anymore. So probably this is the root cause for the convergence issue we have seen in D4144922. It explains that why the model can recover, which is because adagrad decays local learning rate and local updates cause less change. Reviewed By: dzhulgakov Differential Revision: D4229684 fbshipit-source-id: da1241d43d7c52cbf13560f9bb83e09897d8d56f	2016-11-29 15:18:38 -08:00
Huazhong Ning	6ebae91d24	multi-task learning: save model and evaluator Summary: This consists of a series of diffs for implementing Multi-task learning. This diff is to 1. save model; 2. support MT learning in evaluator 3. add unittest. model after merging (saved model): https://our.intern.facebook.com/intern/graphviz/?paste=56793140 Reviewed By: xianjiec Differential Revision: D4123316 fbshipit-source-id: 225bf8616962ec08f4f1ef85729c1e94ba7c373a	2016-11-29 15:18:38 -08:00
Aapo Kyrola	b77aa551a4	add missed comma Summary: D4205610 missed a comma , causing unnecessary logspill with WeightedSum op Reviewed By: Yangqing Differential Revision: D4222806 fbshipit-source-id: ff17c20eae7a7168475f39cc227d3e8ab347288f	2016-11-29 15:18:37 -08:00
Aaron Jaech	c41f0d27c4	adding more things to the list of known operators in model_helper Summary: This is so they don't generate spurious warning messages in the logs Reviewed By: dzhulgakov Differential Revision: D4205610 fbshipit-source-id: f764b51565430f4057898ab929372bc7943e0495	2016-11-29 15:18:35 -08:00
Yangqing Jia	589398950f	fbsync at f5a877	2016-11-18 15:41:06 -08:00
Yangqing Jia	238ceab825	fbsync. TODO: check if build files need update.	2016-11-15 00:00:46 -08:00
Yangqing Jia	d1e9215184	fbsync	2016-10-07 13:08:53 -07:00

8 Commits