Commit Graph

15 Commits

Author SHA1 Message Date
Aapo Kyrola
27e01744b2 Probably fixed memonger
Summary:
This diff fixes various issues with memonger, and works at leasrt with rbgirshick's failure case, Resnet-50, and new harder unit test. I will still create a proper resnet50-test.

1) Introduce concept of "tokens". These are passed down the dependency chains, and a blob can be used for recycling only if it owns all the tokens that are currently in possession. Tokens are added when branching, and tokens are redeemed after all inputs are satisfied. A bit hard to explain.
2) There were various bugs due to bad code: the free_blobs data structure is of different type when we have blob sizes and when we haven't. I plan to rewrite this soon. But there were some bugs.
3) Added a harder unit test that failed before.
4) Added test for resnet50 + memonger

Reviewed By: asaadaldien

Differential Revision: D5193393

fbshipit-source-id: bc2a714877aa1201c32a5ba8ade862865e455711
2017-06-08 09:19:24 -07:00
Peizhao Zhang
87a12dd355 Caught exception when fetching uninitialized blobs when collecting blob sizes in workspace.
Summary: Caught exception when fetching uninitialized blobs when collecting blob sizes in workspace. Some of the output blobs (like mask output of DropOut when is_test=1) may be nullptr and FetchBlob will fail.

Differential Revision: D5198641

fbshipit-source-id: 45ee26c4cb1c25cc48904e9f7d7c007224c97418
2017-06-07 15:35:32 -07:00
Aapo Kyrola
da6b82b810 fix another bug related to in-place ops --> treat in-place ops like any other
Summary:
D5116828 changed how in-place ops were hanled in memonger and fixed a crash in NeuralMT. However, it still produced incorrect memongerization, because an op with one inplace input-output but another non-inplace output would be handled still incorrectly, as the other output's branch would not be followed properly.

This is fixed by actually removing the whole in-place op special handling. This actually is not needed anymore, it was leftover from an older version of memonger that used topological sort of the ops.

Reviewed By: asaadaldien

Differential Revision: D5128142

fbshipit-source-id: b551b0faebdde410e6bd7516958c63cf610cc065
2017-05-24 23:32:03 -07:00
Aapo Kyrola
6c511f64cc fix handling of ops with in-place input/output
Summary: Memonger ignores ops with input and output in-place, but did not work correctly if there were also non-inplace inputs, like with Mul. Simple fix to also look at in-placeness during the traversar.

Reviewed By: jhcross

Differential Revision: D5116828

fbshipit-source-id: 52817f1221597986cc09cc65d094417c1923d965
2017-05-23 18:23:33 -07:00
Aapo Kyrola
f82a510be6 share forward activation blobs + pass unused free blobs down all branches + use shape infernece
Summary:
Added optional support for using activation blobs for sharing as well. Doing this change revealed an non-optimal implementation in the blob sharing: we need to prefer to reuse freeblobs by prefering those blobs that are already shared by many other blobs. Otherwise the memory usage can increase when the pool of 'free blobs' grows.

Also, my first version only passed "free blobs" (i.e blobs in recycling pool) down the first branch when operators forked. But now we pass those blobs that were not used by the first branch down the second branch and so on.

Also added support for blob size information in the heuristic. This uses the shape inference mechanism.

I had to also do some small tweaks:
- use Sum() operator as a way to match shapes of blobs that had otherwise unknown shapes. This is related to the Sum() operator that is added to combine multiple incoming gradient inputs (with _autosplit gradients).
- a couple of random shape inference fixes

This reduces the Resnet-50 memory usage on 64 batch from 9.45 Gig to 8.5 Gig.
For a 32 batch, the memory usage is 4330 MiB, down from 4800 MB, compared to Torch's 6856MiB (thanks prigoyal  for checking this for me).

This is unfortunately quite a bunch to review...

Reviewed By: asaadaldien

Differential Revision: D4393909

fbshipit-source-id: 9c7c94125f96512bea80463ebcb63c215ef95ff9
2017-04-25 14:23:25 -07:00
Luke Yeager
b7be2016aa Fix typos in memonger.py
Summary:
Found while browsing the code. Cool stuff in here!
Closes https://github.com/caffe2/caffe2/pull/276

Differential Revision: D4911421

Pulled By: Yangqing

fbshipit-source-id: 3bef10a4001a6b4d4527c054519d69131799a0e2
2017-04-18 20:52:41 -07:00
Aapo Kyrola
3c9dfe4736 dag-compatible forward memonger
Summary: Memonger's inference optimization is very efficient, but does not work if a multi-threaded DAG net is used. So I added this alternative that shares code with the gradient memonger and does the blob recycling by traversing the DAG and ensuring that blobs do not pass parallel branches.

Reviewed By: viswanathgs

Differential Revision: D4884303

fbshipit-source-id: dfd0a6ecdb91f4edbb0b743729c92f4cd015602e
2017-04-13 22:08:09 -07:00
Peizhao Zhang
cb3bd0ede8 Added a DP + recursion algorithm for finding optimal blob assignments based on blob sizes.
Summary:
Added a DP + recursion algorithm for finding blob assignments based on blob sizes. This algorithm gives optimal assignments. See comments for details.

The algorithm is not used by default, set algo=memonger.AssignmentAlgorithm.DYNAMIC_PROGRAMMING and provide blob_sizes in optimize_interference() to use it. The blob sizes could be retrieved by running the net once and then calling blob_sizes = memonger.collect_blob_sizes(net). All blob sizes are assumed to be 1 if blob_sizes is not provided. In this case, using algo=memonger.AssignmentAlgorithm.GREEDY may be better.

Testing on the segmentation model, the memory usage is reduced by 19% (14.96MB to 12.08MB) comparing using the greedy algorithm (without considering conv share buffer). The algorithm runs in 15s for the model with 55 sharable blobs.

Reviewed By: ajtulloch

Differential Revision: D4818476

fbshipit-source-id: 606936f4cf2715408d60b9a5cf3bcaf1985a0fec
2017-04-07 02:18:08 -07:00
Peizhao Zhang
59f464434d Used blob sizes for finding assignments in a greedy way.
Summary: Used blob sizes for finding assignments in a greedy way.

Reviewed By: ajtulloch

Differential Revision: D4818159

fbshipit-source-id: 89180a6117ba5be058e1d2f9488b06d618e91917
2017-04-06 12:36:38 -07:00
Peizhao Zhang
a54000dc6a Added an ordering function to reduce live spans of computed blobs.
Summary:
Added an ordering function (topological_sort_traversal_longest_path()) to reduce live spans of computed blobs. The idea is to sort the ops based on the length of the execution path so that ops in longer path will be used first.

Tested on segmentation model with on-the-fly decoder and reduced memory usage from 21.7MB to 14MB (original size is 33MB with compressed parameters and without considering the conv buffer), comparing to use topological_sort_traversal() as the ordering function.

It is a general ordering function so I put it in memonger.py directly.

Reviewed By: ajtulloch

Differential Revision: D4790135

fbshipit-source-id: e661b45c1640de44ce1a9fdd009a4fba38f8e042
2017-04-06 12:20:39 -07:00
Aapo Kyrola
02f0c1c9d7 make memonger work with RecurrentNetwork(Gradient)
Summary:
This diff enables support of recurrent networks for memonger:
1. Memonger descends into the step-nets and renames the blobs accordingly
2. Memonger tells the gradient op about the renamed blobs by adding a parameter "paramname.renamed=<new name>"
3. RecurrentNetworkGradientOp applies remapping to links and gradient blobs.

I first thought of refactoring the whole gradient blob management of the recurrent network, but that looks to be very hard without a major revise of the code.

Note, I did not enable memonger for neural_mt, since I think the team should do more testing before enabling this.

Reviewed By: salexspb

Differential Revision: D4812823

fbshipit-source-id: 1ffdf3cfb4fcd00eec5bb0ece3bf416aa6d3e26b
2017-04-05 09:48:25 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Viswanath Sivakumar
9775ffc6ae Fixes to topological sort, canonical blob naming, sharing final blob
Summary: Three small changes:

Reviewed By: ajtulloch

Differential Revision: D4437131

fbshipit-source-id: c849e36e1c4d1dce947076349df863fafe62c66d
2017-01-25 15:14:26 -08:00
Aapo Kyrola
95b3309a87 Gradient Input memory sharing using memonger blob sharing
Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.

In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.

The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.

Module data_parallel_model supports this feature natively.

Reviewed By: prigoyal

Differential Revision: D4363209

fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1
2017-01-09 19:44:23 -08:00
Yangqing Jia
09bed67e4f add untracked files 2016-07-21 11:26:41 -07:00