pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	5d0167c8e7	Example workflow for running disributed (syncsgd) imagenet training in Flow Summary: This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members. This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch. Reviewed By: salexspb Differential Revision: D4223384 fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb	2016-11-29 15:18:38 -08:00
Aapo Kyrola	365ca8da1c	add sanity check that ops do not cross gpus Summary: Debugging nets can be tiresome, so it is good if we can do some sanity checks. This adds a sanity check that all non-NCCL and non-Copy operators do not reference blobs that have different device scope than the operator. This check is only added to the data_parallel_model, so it should be safe. This check would had caught a subtle bugin prigoyal's training pipeline. Reviewed By: dzhulgakov Differential Revision: D4230444 fbshipit-source-id: 3d4a843162134a7a504053d95ff97a552e6b8a6d	2016-11-29 15:18:38 -08:00
Aapo Kyrola	42279a610c	use Pieter-MPI and fb.distributed Summary: Remove MPI and use fb.distributed rendezvous and Pieter's new Ops. One now can pass a 'rendezvous' struct to data_parallel_model to initiate distributed SyncSGD. Provided rendezvoud implementation uses the kv-store handler of fb.distributed to disseminate information about other hosts. We can easily add other rendezvous, such as file-based, but that is topic of another diff. Removing MPI allowed also simplifiying of Xray startup scripts, which are included in this diff. When accepted, I will work on a simple example code so others can use this stuff as well. Also Flow implementation will be topic of next week. Differential Revision: D4180012 fbshipit-source-id: 9e74f1fb43eaf7d4bb3e5ac6718d76bef2dfd731	2016-11-29 15:18:36 -08:00
Yangqing Jia	589398950f	fbsync at f5a877	2016-11-18 15:41:06 -08:00
Yangqing Jia	238ceab825	fbsync. TODO: check if build files need update.	2016-11-15 00:00:46 -08:00
Yangqing Jia	44509f9f91	fbsync: mostly lint changes, added mkl files	2016-10-11 22:45:06 -07:00
Yangqing Jia	d1e9215184	fbsync	2016-10-07 13:08:53 -07:00

1 2

57 Commits