Summary:
prigoyal sharply noticed a bug in the Resnet models: we have not been checkpointing, nor synchronizing between gpus, the moving average and variance computed by the SpatialBN ops. Particularly the first problen is serious, since models starting from checkpoint would have started from a null-state for SpatialBN. Not synchronizing with the data parallel model is less tragic since each GPU should see very similar data.
Thus I propose keeping track of "computed params", i.e params that are computed from data but not optimized. I don't know if there are other examples, but SpatialBN's moving avg and var definitely are one.
- I modified the checkpointign for xray model to store those blobs + also ensure the synchronization of those blobs
- I modified data parallel model to broadcast those params from gpu0. I first tried averaging, but hit some NCCL deadlocks ... :(
Differential Revision: D4281265
fbshipit-source-id: 933311afeec4b7e9344a13cf2d38aa939c50ac31
Summary: Make xray net_type configub a command line argument
Differential Revision: D4262076
fbshipit-source-id: e2ecb9cd5bee5d6aaebe0ea8d2d4d9b378058cba
Summary: Switching to Pieter-MPI changed the way we setup network between operators. For syncronizing parameters after a checkpoint load, we run a checkpoint_net that contaiend operators for creating the common world and broadcast operators. Unfortunately this fails when the checkpoint sync is done a second time, because we would have created a duplicate common world. Solution is to separate common world op and broadcast op to init net and the actual broadcasting net, and we run the init net only once. This problem did not arise in the Flow version since I did only one checkpoint loading per operator (process).
Differential Revision: D4251754
fbshipit-source-id: ba030579e651e529e29bbf2d27920075078d8ff9
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.
Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.
Reviewed By: prigoyal
Differential Revision: D4248907
fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be
Summary:
This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members.
This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch.
Reviewed By: salexspb
Differential Revision: D4223384
fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb
Summary: Debugging nets can be tiresome, so it is good if we can do some sanity checks. This adds a sanity check that all non-NCCL and non-Copy operators do not reference blobs that have different device scope than the operator. This check is only added to the data_parallel_model, so it should be safe. This check would had caught a subtle bugin prigoyal's training pipeline.
Reviewed By: dzhulgakov
Differential Revision: D4230444
fbshipit-source-id: 3d4a843162134a7a504053d95ff97a552e6b8a6d
Summary:
Remove MPI and use fb.distributed rendezvous and Pieter's new Ops.
One now can pass a 'rendezvous' struct to data_parallel_model to initiate distributed SyncSGD. Provided rendezvoud implementation uses the kv-store handler of fb.distributed to disseminate information about other hosts. We can easily add other rendezvous, such as file-based, but that is topic of another diff.
Removing MPI allowed also simplifiying of Xray startup scripts, which are included in this diff.
When accepted, I will work on a simple example code so others can use this stuff as well. Also Flow implementation will be topic of next week.
Differential Revision: D4180012
fbshipit-source-id: 9e74f1fb43eaf7d4bb3e5ac6718d76bef2dfd731