pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Peizhao Zhang	59f464434d	Used blob sizes for finding assignments in a greedy way. Summary: Used blob sizes for finding assignments in a greedy way. Reviewed By: ajtulloch Differential Revision: D4818159 fbshipit-source-id: 89180a6117ba5be058e1d2f9488b06d618e91917	2017-04-06 12:36:38 -07:00
Peizhao Zhang	a54000dc6a	Added an ordering function to reduce live spans of computed blobs. Summary: Added an ordering function (topological_sort_traversal_longest_path()) to reduce live spans of computed blobs. The idea is to sort the ops based on the length of the execution path so that ops in longer path will be used first. Tested on segmentation model with on-the-fly decoder and reduced memory usage from 21.7MB to 14MB (original size is 33MB with compressed parameters and without considering the conv buffer), comparing to use topological_sort_traversal() as the ordering function. It is a general ordering function so I put it in memonger.py directly. Reviewed By: ajtulloch Differential Revision: D4790135 fbshipit-source-id: e661b45c1640de44ce1a9fdd009a4fba38f8e042	2017-04-06 12:20:39 -07:00
Aapo Kyrola	95b3309a87	Gradient Input memory sharing using memonger blob sharing Summary: This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs. In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock. The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough. Module data_parallel_model supports this feature natively. Reviewed By: prigoyal Differential Revision: D4363209 fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1	2017-01-09 19:44:23 -08:00
Yangqing Jia	09bed67e4f	add untracked files	2016-07-21 11:26:41 -07:00

4 Commits