Commit Graph

56 Commits

Author SHA1 Message Date
Aapo Kyrola
365ca8da1c add sanity check that ops do not cross gpus
Summary: Debugging nets can be tiresome, so it is good if we can do some sanity checks. This adds a sanity check that all non-NCCL and non-Copy operators do not reference blobs that have different device scope than the operator. This check is only added to the data_parallel_model, so it should be safe. This check would had caught a subtle bugin prigoyal's training pipeline.

Reviewed By: dzhulgakov

Differential Revision: D4230444

fbshipit-source-id: 3d4a843162134a7a504053d95ff97a552e6b8a6d
2016-11-29 15:18:38 -08:00
Aapo Kyrola
42279a610c use Pieter-MPI and fb.distributed
Summary:
Remove MPI and use fb.distributed rendezvous and Pieter's new Ops.

One now can pass a 'rendezvous' struct to data_parallel_model to initiate distributed SyncSGD. Provided rendezvoud implementation uses the kv-store handler of fb.distributed to disseminate information about other hosts. We can easily add other rendezvous, such as file-based, but that is topic of another diff.

Removing MPI allowed also simplifiying of Xray startup scripts, which are included in this diff.

When accepted, I will work on a simple example code so others can use this stuff as well. Also Flow implementation will be topic of next week.

Differential Revision: D4180012

fbshipit-source-id: 9e74f1fb43eaf7d4bb3e5ac6718d76bef2dfd731
2016-11-29 15:18:36 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
44509f9f91 fbsync: mostly lint changes, added mkl files 2016-10-11 22:45:06 -07:00
Yangqing Jia
d1e9215184 fbsync 2016-10-07 13:08:53 -07:00