Summary:
compute_interference_graph() was not able to handle the case when a blob is reused twice for operators supporting in-place parameters. For example, for the following network with operators Mul and Sub
(blob) -> [Mul] -> (blob) -> [Sub] -> (blob)
an incorrect edge will be added from [Sub] to [Mul] and causes nx.is_directed_acyclic_graph() to fail.
Reviewed By: ajtulloch
Differential Revision: D5271604
fbshipit-source-id: f6095b6f8e1dba556ba223a82c8170be7f744529
Summary: Make verify_graph_equality get called by share_grad_blobs and optimize_inference_for_dag
Reviewed By: akyrola
Differential Revision: D5288993
fbshipit-source-id: b9f105ce00148b2673eed2dd390ab74f82f990ad
Summary:
kmatzen why did you set the stepsize in ff84e7dea6?
The test is flaky before this change. Solid afterwards.
Closes https://github.com/caffe2/caffe2/pull/841
Differential Revision: D5292112
Pulled By: akyrola
fbshipit-source-id: c84715261194ff047606d4ec659b7f89dac3cbb1
Summary:
/cc akyrola is it possible this test has been broken ever since 5614816fce?
More generally, why do we still have `hypothesis_test.py` at all? In the case of this test, surely one of these files does more than this one old test:
* `operator_test/cudnn_recurrent_test.py`
* `operator_test/recurrent_network_test.py`
* `operator_test/rnn_cell_test.py`
Closes https://github.com/caffe2/caffe2/pull/843
Differential Revision: D5292109
Pulled By: akyrola
fbshipit-source-id: 6df5df6353a9741d1ae1b796adaab98382857527
Summary:
Funnily, the biggest issue when trying to increase number of trainers from 5 to 20 is not model convergence (it is worse but still converges without tuning); it is the initialization time: it took around 30 min to generate the job.
After this diff, job creation time for the standard 5-7 setup goes from 125s to 8s. (15x speedup).
Another improvement is that ##net_printer.to_string(job)## becomes less complex.
This makes the startup for 20 trainers go to 32s, which is still not ideal.
Next step will be to allow passing num_instances to Node as well. This way we'll be able to create only one reader and one trainer prototype and let the framework take care of the scheduling. For this one we will need to move some DataStream and PS initialization code to C++ first. (c.c. aartibasant)
Reviewed By: dzhulgakov
Differential Revision: D5100788
fbshipit-source-id: 7b76bce108f527a96b2bfe7ed43a22ea8679b682
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).
Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.
Reviewed By: wesolwsk
Differential Revision: D5277350
fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
Summary:
Advantages of cloning the tasks/execution_steps at runtime:
- Less complexity on the python side: no need to clone nets and add prefixes to blob names
- Faster start-up: we had cases of complex plans that took up to 30min to be created.
- Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs.
- Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime.
Reviewed By: dzhulgakov
Differential Revision: D5100730
fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc
Summary:
- Incorporated dropout layer to the sparseNN training and testing pipeline
- Integrated an advanced model options feature on Flow UI for users to specify dropout rate
- Created an end-to-end unit test to build and run a model with dropout
Reviewed By: chocjy
Differential Revision: D5273478
fbshipit-source-id: f7ae7bf4de1172b6e320f5933eaaebca3fd8749e
Summary:
Given the parameter init_params=False, Weight Blob(*_w) and Bias Blob (*_b) should be suppressed in model.param_init_net. Without this fix, the init_params=False doesn't take effect in brew.conv as it does in brew.fc or other ops. This issue is the root cause of #790 [https://github.com/caffe2/caffe2/pull/790].
Closes https://github.com/caffe2/caffe2/pull/824
Reviewed By: harouwu
Differential Revision: D5276676
Pulled By: akyrola
fbshipit-source-id: 8f7088a8e1976658f67e027223e555375b3a2392
Summary:
Since D5193393 introduced a "token" system for memonger that prevents sharing of blobs across parallel branches, we can be more aggressive in blob sharing. Thus, this removes the tracking of 'unused free blobs' and just relies on the token system.
For forward-only resnet50, this reduces the number of shared blobs to 5 (optimal accorsing to akirillov's calculation).
This requires careful testing, so I will not land it soon.
Reviewed By: asaadaldien
Differential Revision: D5208985
fbshipit-source-id: 2e520c4ea2351a2ec327b6c5f2e3af24234d1c9a
Summary: As title. Pretty straightforward. Could actually run each kernel in parallel, but we can optimize later if needed.
Reviewed By: Yangqing
Differential Revision: D5278415
fbshipit-source-id: 29f59afe28f37fc4152ec7eb7cd6c1ab65f2cb8c
Summary:
a few issues:
1. Randomization hurts memoization
1. Even if we make it non random, then we can get key colisions when loading it back.
2. RNNs use prototxt for step net and apparently its not forward compatible like normal protobuf is
I am thinking of a better less invasive solution now.
Reviewed By: jamesr66a
Differential Revision: D5272118
fbshipit-source-id: ab577fad04fbfc632e1fceffa923377a0d3da1be
Summary: Ran into it while working on a dper benchmark. Apparently it works harmless even with empty tensors.
Reviewed By: akyrola
Differential Revision: D5273672
fbshipit-source-id: a968ae03a659d6c1a215f12cc35f7ba68448e833
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.
`E InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(4, 2, 5, 1, 3, 5, 5, 1), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 24576000.`
https://travis-ci.org/caffe2/caffe2/jobs/243867951
Closes https://github.com/caffe2/caffe2/pull/828
Differential Revision: D5276723
Pulled By: akyrola
fbshipit-source-id: f7d0e2dd8ef8b6a2354bd4ff7c7446c377c954b4
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.
`E InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(20, 12, 22), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 43253760.`
https://travis-ci.org/caffe2/caffe2/jobs/243867951
/cc kittipatv
Closes https://github.com/caffe2/caffe2/pull/830
Differential Revision: D5276639
Pulled By: akyrola
fbshipit-source-id: 0c21be25ecd931837dc8b0c2cc17048f531350d1
Summary:
We want to make sure that a graph optimized by memonger doesn't have any possibility of two threads writing into the same output blob at the same time, when blobs are renamed.
Creates a graph where edges are built such that a parents node's output blob is a child node's input blob, and there is no node in between the parent and child node that writes to the same blob. If two nets generate the same such graph, then the "path" of data is the same.
Reviewed By: akyrola
Differential Revision: D5210385
fbshipit-source-id: 6317fc4e16289339b50c2dcd86ec8b32d2d544a5
Summary:
This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU.
There are two algorithm implementations:
-for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data.
-for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense.
Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library.
Reviewed By: jamesr66a
Differential Revision: D5248206
fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3
Summary: The old version used one block with 128 threads. Throughput was too low for the NMT use case (calculating squared gradient norms for every parameter), so this increases the throughput. Shaves 7% off CNN model training time per step
Reviewed By: wickedfoo
Differential Revision: D5263748
fbshipit-source-id: adc3bacd11e49ea00c60381d613d993050e899be
Summary:
While this is not intended to be the best performat and
general solution, we can see from the test plan in some cases static DAG RNN could
perform better than our own implementation. Hopefully we will get
dynamic RNN DAG execution at least as fast as this one. Then we will
not need this one in production, only for testing.
Still putting it into our benchmark for comparison purposes
Reviewed By: akyrola
Differential Revision: D5210038
fbshipit-source-id: fa44baf51c455872abd6ec5f5d151cf06e15b1fa
Summary: I accidentaly noticed that we were calling the non-CUDNN version of Transpose with attention, and it is super slow. This broke when rnn_cell was changed to use ModelHelper instead of CNNModelHelper in D5062963, but calls to transpose were not "brewed".
Reviewed By: jamesr66a
Differential Revision: D5264248
fbshipit-source-id: b61494ae210f34597245f1195d20547f5b5cd8b5
Summary: Don't want to assert since it can be useful to sometimes create models that are not run (for example, unit tests).
Reviewed By: pietern
Differential Revision: D5258905
fbshipit-source-id: f1beee0605bfef235ed0f23f7e78259109720254
Summary: This makes it easier to gather top-K by group of rows. This is useful in the situation where we want to pick up top-K from batch of fixed length sessions. Let `N` be number of sessions, and `M` be number of examples in a sessions. We would have a batch of `N * M` rows. We can reshape the score blob to `N x M`, and use it as input to `TopK` to select top score for each session. However, without the new output, it's would be inconvenient to gather the rows corresponding to the top scores. The indices are in `[0, K-1)` range. The new output can be used directly as input to `Gather`.
Reviewed By: chocjy
Differential Revision: D5171459
fbshipit-source-id: 69f7b41456c3f9670650ae07afc8fef8328485e9
Summary:
The global StatRegistry doesn't get reset when the workspace is reset.
```
> self.assertTrue(len(workspace.FetchBlob('k3')) == 2)
E AssertionError: False is not true
```
https://travis-ci.org/lukeyeager/caffe2/jobs/240162665
/cc azzolini
NOTE: this error doesn't show up if you just run `stats_ops_test.py` directly. It shows up when you run other tests in the same session before this test:
```
pytest -v caffe2/python/
```
Closes https://github.com/caffe2/caffe2/pull/788
Differential Revision: D5259232
Pulled By: salexspb
fbshipit-source-id: 3c72633af6bb61c4fda62195298b1e9574b4cbef
Summary: Upgrades this file to use brew instead of CNNHelperModel
Reviewed By: harouwu
Differential Revision: D5252089
fbshipit-source-id: 6df4350717c1d42bc4bcc63d255cd422f085ee05
Summary: Implementation of the SliceOp for CUDA
Reviewed By: akyrola
Differential Revision: D5254287
fbshipit-source-id: 0a1660e1aa161fd088a2d8f886e019c05a1919a2
Summary:
```
File "/data/caffe2/install/caffe2/python/hypothesis_test.py", line 1911, in test_batch_to_space
(w + 2 * pad) / block_size).astype(np.float32)
File "mtrand.pyx", line 1404, in mtrand.RandomState.randn (numpy/random/mtrand/mtrand.c:19843)
File "mtrand.pyx", line 1534, in mtrand.RandomState.standard_normal (numpy/random/mtrand/mtrand.c:20368)
File "mtrand.pyx", line 167, in mtrand.cont0_array (numpy/random/mtrand/mtrand.c:6127)
TypeError: 'float' object cannot be interpreted as an index
```
```
File "/data/caffe2/install/caffe2/python/operator_test/tile_op_test.py", line 101, in tile_ref
tiled_data = np.tile(X, tuple(dims))
File "/data/caffe2/venv/local/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 881, in tile
return c.reshape(shape_out)
TypeError: only integer scalar arrays can be converted to a scalar index
```
I also tested to make sure this still works with 0.11.
Closes https://github.com/caffe2/caffe2/pull/787
Differential Revision: D5248087
Pulled By: salexspb
fbshipit-source-id: eff69482a8eabb8ace330003fa326c832b53865f
Summary: Deprecate CNNModelHelper in python/workspace_test.py to use Model_Helper instead of CNN
Reviewed By: harouwu
Differential Revision: D5251778
fbshipit-source-id: d634f1c76e41a95b0247ebf5d5a48aef6f8e232e
Summary:
This diff deprecates `CNNModelHelper` in the `AlexNet()` function. More diffs will be coming to deprecate the helper in other functions.
Depends on D5241738
Reviewed By: harouwu
Differential Revision: D5247004
fbshipit-source-id: eec5c5ef916a85de8289cb92d2174a6a4b8075bf
Summary: Hard-to-debug problems arise when a gradient creator fails when the forward op is incorrect itself. Add checking of the schema before callig the creator. Also clarify the error messages
Reviewed By: Yangqing
Differential Revision: D5256016
fbshipit-source-id: 78550f7e2ce5b88e26b69fdae4be0eece52edfea
Summary:
The current version of schema.py has a Metadata class with three fields. The default for it is set to
four Nones. This is just changing that to three Nones so that the number of default values matches the number
of actual fields.
Reviewed By: kennyhorror
Differential Revision: D5250463
fbshipit-source-id: 42e5650d270f5f63662614d8445b4819ed370dec