Summary:
This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members.
This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch.
Reviewed By: salexspb
Differential Revision: D4223384
fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb
Summary: Debugging nets can be tiresome, so it is good if we can do some sanity checks. This adds a sanity check that all non-NCCL and non-Copy operators do not reference blobs that have different device scope than the operator. This check is only added to the data_parallel_model, so it should be safe. This check would had caught a subtle bugin prigoyal's training pipeline.
Reviewed By: dzhulgakov
Differential Revision: D4230444
fbshipit-source-id: 3d4a843162134a7a504053d95ff97a552e6b8a6d
Summary:
Remove MPI and use fb.distributed rendezvous and Pieter's new Ops.
One now can pass a 'rendezvous' struct to data_parallel_model to initiate distributed SyncSGD. Provided rendezvoud implementation uses the kv-store handler of fb.distributed to disseminate information about other hosts. We can easily add other rendezvous, such as file-based, but that is topic of another diff.
Removing MPI allowed also simplifiying of Xray startup scripts, which are included in this diff.
When accepted, I will work on a simple example code so others can use this stuff as well. Also Flow implementation will be topic of next week.
Differential Revision: D4180012
fbshipit-source-id: 9e74f1fb43eaf7d4bb3e5ac6718d76bef2dfd731