Summary: Fix a bug reported by dzhulgakov that occurs when input blobs is used twice in a same op --> it was released to the recycled blobs pool twice.
Reviewed By: dzhulgakov, volkhin
Differential Revision: D5414023
fbshipit-source-id: 861bb46fe901023cb9a496401736e6ecb77d5fae
Summary:
To be used with predictor "online": C++ version of memonger for simple nets. Very simple greedy algorithm. Works well at least on Resnet-50 inference graph: only 3 shared blobs are used.
Next I will integrate this with predictor and run canary (separate diff).
Reviewed By: asaadaldien
Differential Revision: D5375392
fbshipit-source-id: d36e419e39a32e568e105657c27fb00c85a2535d
Summary: Memonger had a bug that it crashes if an input blob was input to multiple ops. This fixes that and adds a test.
Reviewed By: asaadaldien
Differential Revision: D5374860
fbshipit-source-id: 1d5044001eacdbe6db43f69727da9297558f5c5c
Summary: Lets try this again. Verify graphs every time memonger is run. Will definitely check for time though.
Reviewed By: akyrola
Differential Revision: D5308188
fbshipit-source-id: 512a76c759b670d31c49d1d492dd8ee1eaf3bafd
Summary:
compute_interference_graph() was not able to handle the case when a blob is reused twice for operators supporting in-place parameters. For example, for the following network with operators Mul and Sub
(blob) -> [Mul] -> (blob) -> [Sub] -> (blob)
an incorrect edge will be added from [Sub] to [Mul] and causes nx.is_directed_acyclic_graph() to fail.
Reviewed By: ajtulloch
Differential Revision: D5271604
fbshipit-source-id: f6095b6f8e1dba556ba223a82c8170be7f744529
Summary: Make verify_graph_equality get called by share_grad_blobs and optimize_inference_for_dag
Reviewed By: akyrola
Differential Revision: D5288993
fbshipit-source-id: b9f105ce00148b2673eed2dd390ab74f82f990ad
Summary:
Since D5193393 introduced a "token" system for memonger that prevents sharing of blobs across parallel branches, we can be more aggressive in blob sharing. Thus, this removes the tracking of 'unused free blobs' and just relies on the token system.
For forward-only resnet50, this reduces the number of shared blobs to 5 (optimal accorsing to akirillov's calculation).
This requires careful testing, so I will not land it soon.
Reviewed By: asaadaldien
Differential Revision: D5208985
fbshipit-source-id: 2e520c4ea2351a2ec327b6c5f2e3af24234d1c9a
Summary:
We want to make sure that a graph optimized by memonger doesn't have any possibility of two threads writing into the same output blob at the same time, when blobs are renamed.
Creates a graph where edges are built such that a parents node's output blob is a child node's input blob, and there is no node in between the parent and child node that writes to the same blob. If two nets generate the same such graph, then the "path" of data is the same.
Reviewed By: akyrola
Differential Revision: D5210385
fbshipit-source-id: 6317fc4e16289339b50c2dcd86ec8b32d2d544a5
Summary: Also fixed a small bug in ModelHelper constructor
Reviewed By: harouwu
Differential Revision: D5246799
fbshipit-source-id: 3719ca078f0e2b5e463fc93da9c8215f5583bd9a
Summary:
This diff fixes various issues with memonger, and works at leasrt with rbgirshick's failure case, Resnet-50, and new harder unit test. I will still create a proper resnet50-test.
1) Introduce concept of "tokens". These are passed down the dependency chains, and a blob can be used for recycling only if it owns all the tokens that are currently in possession. Tokens are added when branching, and tokens are redeemed after all inputs are satisfied. A bit hard to explain.
2) There were various bugs due to bad code: the free_blobs data structure is of different type when we have blob sizes and when we haven't. I plan to rewrite this soon. But there were some bugs.
3) Added a harder unit test that failed before.
4) Added test for resnet50 + memonger
Reviewed By: asaadaldien
Differential Revision: D5193393
fbshipit-source-id: bc2a714877aa1201c32a5ba8ade862865e455711
Summary:
Failure mode:
```
- 7 passing examples, 0 failing examples, 0 invalid examples
- Typical runtimes: 12-14987 ms
- Stopped because settings.timeout=60
```
After this change:
```
- 5 passing examples, 0 failing examples, 0 invalid examples
- Typical runtimes: 12-15475 ms
- Stopped because settings.max_examples=5
```
Obviously, the `DYNAMIC_PROGRAMMING` tests are the troublemakers. An alternate solution would be to make separate tests for the two assignment algorithms (one fast, one slow).
Closes https://github.com/caffe2/caffe2/pull/676
Differential Revision: D5147363
Pulled By: akyrola
fbshipit-source-id: 85d9f8198e53c10de2a8d6645e2b0eb7953c96e0
Summary:
D5116828 changed how in-place ops were hanled in memonger and fixed a crash in NeuralMT. However, it still produced incorrect memongerization, because an op with one inplace input-output but another non-inplace output would be handled still incorrectly, as the other output's branch would not be followed properly.
This is fixed by actually removing the whole in-place op special handling. This actually is not needed anymore, it was leftover from an older version of memonger that used topological sort of the ops.
Reviewed By: asaadaldien
Differential Revision: D5128142
fbshipit-source-id: b551b0faebdde410e6bd7516958c63cf610cc065
Summary: Memonger ignores ops with input and output in-place, but did not work correctly if there were also non-inplace inputs, like with Mul. Simple fix to also look at in-placeness during the traversar.
Reviewed By: jhcross
Differential Revision: D5116828
fbshipit-source-id: 52817f1221597986cc09cc65d094417c1923d965
Summary:
Added optional support for using activation blobs for sharing as well. Doing this change revealed an non-optimal implementation in the blob sharing: we need to prefer to reuse freeblobs by prefering those blobs that are already shared by many other blobs. Otherwise the memory usage can increase when the pool of 'free blobs' grows.
Also, my first version only passed "free blobs" (i.e blobs in recycling pool) down the first branch when operators forked. But now we pass those blobs that were not used by the first branch down the second branch and so on.
Also added support for blob size information in the heuristic. This uses the shape inference mechanism.
I had to also do some small tweaks:
- use Sum() operator as a way to match shapes of blobs that had otherwise unknown shapes. This is related to the Sum() operator that is added to combine multiple incoming gradient inputs (with _autosplit gradients).
- a couple of random shape inference fixes
This reduces the Resnet-50 memory usage on 64 batch from 9.45 Gig to 8.5 Gig.
For a 32 batch, the memory usage is 4330 MiB, down from 4800 MB, compared to Torch's 6856MiB (thanks prigoyal for checking this for me).
This is unfortunately quite a bunch to review...
Reviewed By: asaadaldien
Differential Revision: D4393909
fbshipit-source-id: 9c7c94125f96512bea80463ebcb63c215ef95ff9
Summary: Memonger's inference optimization is very efficient, but does not work if a multi-threaded DAG net is used. So I added this alternative that shares code with the gradient memonger and does the blob recycling by traversing the DAG and ensuring that blobs do not pass parallel branches.
Reviewed By: viswanathgs
Differential Revision: D4884303
fbshipit-source-id: dfd0a6ecdb91f4edbb0b743729c92f4cd015602e
Summary:
Added a DP + recursion algorithm for finding blob assignments based on blob sizes. This algorithm gives optimal assignments. See comments for details.
The algorithm is not used by default, set algo=memonger.AssignmentAlgorithm.DYNAMIC_PROGRAMMING and provide blob_sizes in optimize_interference() to use it. The blob sizes could be retrieved by running the net once and then calling blob_sizes = memonger.collect_blob_sizes(net). All blob sizes are assumed to be 1 if blob_sizes is not provided. In this case, using algo=memonger.AssignmentAlgorithm.GREEDY may be better.
Testing on the segmentation model, the memory usage is reduced by 19% (14.96MB to 12.08MB) comparing using the greedy algorithm (without considering conv share buffer). The algorithm runs in 15s for the model with 55 sharable blobs.
Reviewed By: ajtulloch
Differential Revision: D4818476
fbshipit-source-id: 606936f4cf2715408d60b9a5cf3bcaf1985a0fec
Summary: Used blob sizes for finding assignments in a greedy way.
Reviewed By: ajtulloch
Differential Revision: D4818159
fbshipit-source-id: 89180a6117ba5be058e1d2f9488b06d618e91917
Summary:
Added an ordering function (topological_sort_traversal_longest_path()) to reduce live spans of computed blobs. The idea is to sort the ops based on the length of the execution path so that ops in longer path will be used first.
Tested on segmentation model with on-the-fly decoder and reduced memory usage from 21.7MB to 14MB (original size is 33MB with compressed parameters and without considering the conv buffer), comparing to use topological_sort_traversal() as the ordering function.
It is a general ordering function so I put it in memonger.py directly.
Reviewed By: ajtulloch
Differential Revision: D4790135
fbshipit-source-id: e661b45c1640de44ce1a9fdd009a4fba38f8e042
Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.
In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.
The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.
Module data_parallel_model supports this feature natively.
Reviewed By: prigoyal
Differential Revision: D4363209
fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1