pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Pieter Noordhuis	92101aa87a	Update resnet50 example Summary: Make it use Gloo and optionally use Redis for rendezvous (where a shared filesystem is not available). Differential Revision: D4709943 fbshipit-source-id: 59cc7a14316c7b634417ea5161a75fab3c19f2fa	2017-03-15 08:18:50 -07:00
Pieter Noordhuis	6729d81418	Specify which GPUs to use in resnet50 example Summary: TSIA This change also fixes an undefined attribute error after running 20 iterations of the resnet50 example trainer. Differential Revision: D4692794 fbshipit-source-id: b98efdfeb078c5ba89d2a86837f3c672e1eade5f	2017-03-12 22:33:15 -07:00
Ahmed Taei	4f0e7730a9	Distrubited Multi-GPU resnet50 Summary: Use filesystem rendezvous for dist-multi GPU training. Differential Revision: D4664945 fbshipit-source-id: 7b6767323e94bc4e7fa25ef3eba65b38abb79341	2017-03-08 11:39:29 -08:00
Aapo Kyrola	1c7886701e	lr_scale to loss_scale Summary: As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/. So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this. In this diff I modified all my models to work correctly. Reviewed By: Yangqing Differential Revision: D4507002 fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90	2017-02-03 07:44:40 -08:00
Aapo Kyrola	95b3309a87	Gradient Input memory sharing using memonger blob sharing Summary: This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs. In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock. The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough. Module data_parallel_model supports this feature natively. Reviewed By: prigoyal Differential Revision: D4363209 fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1	2017-01-09 19:44:23 -08:00
Aapo Kyrola	e8dc09064e	exhaustive_search=True Summary: For some reason I had been disabling the exhaustive search heuristic for cudnn for xray/resnet trainers. On BigBasin, this gives 10% perf boost. On BigSur maybe 5%. Reviewed By: prigoyal Differential Revision: D4338654 fbshipit-source-id: 3974dd612f5d4f4dc8b2febccb59664d3f276c3e	2016-12-15 22:59:27 -08:00
Aapo Kyrola	68cfc52452	MomemtumSGDUpdate -- version of MomentumSGD with update. Summary: It gives a significant perf boost to do the parameter update inside MomentumSGD, instead of with a separate WeightedSum op. To ensure backwards compatibility, I made it a separate op. Also added an unit test. Reviewed By: prigoyal Differential Revision: D4262446 fbshipit-source-id: 38e7ee6d7677b398658ac7fe9b7a59b569e033f4	2016-12-15 12:01:29 -08:00
Aapo Kyrola	3410939459	pass learning rate scaling factor to parameter update builder function Summary: When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus. Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size. Reviewed By: prigoyal Differential Revision: D4248907 fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be	2016-12-05 11:53:26 -08:00
Aapo Kyrola	b9f1555b6a	remove unused function from resnet50_trainer Summary: Just noticed that I had duplicate code in the example imagenet trainer. Removed the function. Differential Revision: D4223070 fbshipit-source-id: 443a9401bf7e425f7a3a13a44c9d0f7e21e72303	2016-11-29 15:18:37 -08:00
Yangqing Jia	589398950f	fbsync at f5a877	2016-11-18 15:41:06 -08:00

10 Commits