Commit Graph

4 Commits

Author SHA1 Message Date
Aapo Kyrola
d37fffd257 use in-place ReLu to safe a lot of memory
Summary: Reading Torch docs about Resnets, and soumith's comment,  they mention significant memory-saving with in-place ReLu. prigoyal already had this in her code, but I did not. This saves memory a lot: 9851 MiB -> 7497 MiB.

Reviewed By: prigoyal

Differential Revision: D4346100

fbshipit-source-id: e9c5d5e93787f47487fade668b65b9619bfc9741
2016-12-19 09:29:26 -08:00
Aapo Kyrola
eddf23ca0f Handle parameters that are computed but not optimized
Summary:
prigoyal sharply noticed a bug in the Resnet models: we have not been checkpointing, nor synchronizing between gpus, the moving average and variance computed by the SpatialBN ops.  Particularly the first problen is serious, since models starting from checkpoint would have started from a null-state for SpatialBN. Not synchronizing with the data parallel model is less tragic since each GPU should see very similar data.

Thus I propose keeping track of "computed params", i.e params that are computed from data but not optimized. I don't know if there are other examples, but SpatialBN's moving avg and var definitely are one.

- I modified the checkpointign for xray model to store those blobs + also ensure the synchronization of those blobs
- I modified data parallel model to broadcast those params from gpu0. I first tried averaging, but hit some NCCL deadlocks ... :(

Differential Revision: D4281265

fbshipit-source-id: 933311afeec4b7e9344a13cf2d38aa939c50ac31
2016-12-15 12:01:28 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
d1e9215184 fbsync 2016-10-07 13:08:53 -07:00