pytorch

OSSForks/pytorch

Fork 0

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Commit Graph

Author	SHA1	Message	Date
Viswanath Sivakumar	9775ffc6ae	Fixes to topological sort, canonical blob naming, sharing final blob Summary: Three small changes: Reviewed By: ajtulloch Differential Revision: D4437131 fbshipit-source-id: c849e36e1c4d1dce947076349df863fafe62c66d	2017-01-25 15:14:26 -08:00
Aapo Kyrola	95b3309a87	Gradient Input memory sharing using memonger blob sharing Summary: This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs. In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock. The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough. Module data_parallel_model supports this feature natively. Reviewed By: prigoyal Differential Revision: D4363209 fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1	2017-01-09 19:44:23 -08:00
Yangqing Jia	09bed67e4f	add untracked files	2016-07-21 11:26:41 -07:00

Author

SHA1

Message

Date

Viswanath Sivakumar

9775ffc6ae

Fixes to topological sort, canonical blob naming, sharing final blob

Summary: Three small changes:

Reviewed By: ajtulloch

Differential Revision: D4437131

fbshipit-source-id: c849e36e1c4d1dce947076349df863fafe62c66d

2017-01-25 15:14:26 -08:00

Aapo Kyrola

95b3309a87

Gradient Input memory sharing using memonger blob sharing

Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.

In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.

The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.

Module data_parallel_model supports this feature natively.

Reviewed By: prigoyal

Differential Revision: D4363209

fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1

2017-01-09 19:44:23 -08:00

Yangqing Jia

09bed67e4f

add untracked files

2016-07-21 11:26:41 -07:00

3 Commits