Summary:
This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking.
Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well.
This diff also restores the link-ops that I unlanded earlier.
The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with.
Reviewed By: salexspb
Differential Revision: D5001637
fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8
Summary:
The hive reader checkpoints are broken because of D5582328.
This breaks our offline simulator test as well.
This is a temporary fix that disables the checkpoints for readers.
Reviewed By: azzolini
Differential Revision: D5637719
fbshipit-source-id: 4f31ae534cb7e981fcacbb721cbb2420249fad91
Summary:
After this, we should have test going back to all green.
Closes https://github.com/caffe2/caffe2/pull/1058
Reviewed By: harouwu
Differential Revision: D5637495
Pulled By: Yangqing
fbshipit-source-id: ac3ab5a27bc56e3bb08fa81aa8ed186cb7e8832b
Summary:
Adds a benchmark comparing two methods used to generate positional embeddings,
table-based and sinusoid (as in the Transformer paper).
Reviewed By: jamesr66a
Differential Revision: D5625633
fbshipit-source-id: faee2d20ea0c3d9c41479c5114fa010ac49fab24
Summary:
Here is my example:
For static RNN timestep is created as a part of param_init_net. Before DPM assumed that it is CUDA blob by default and it participated in broadcasting causing Copy on line 798 to fail. No device mapping is correct for this blob.
Reviewed By: akyrola
Differential Revision: D5631716
fbshipit-source-id: 28c3eb17ecc3080c95c41d69a60bf7262d3907d4
Summary:
Memonger had a subtle bug which caused it to recycle "splitinfo" outputs of Concat/Split. That is bad since they are in CPU device, and woult cause them to be realloaced. This caused big slowdown with Kaiming's trainer.
Bug was that we checked for gradients as contaning "_grad" in the name, although we should only allow it as a suffix. Admittedly, this is not elegant to do string checking anyways, but that is how Caffe2 works now.
Reviewed By: asaadaldien
Differential Revision: D5627251
fbshipit-source-id: c12be2323109bf81c3725d8884c7ef024e010bd5
Summary: Use the new SequenceMask op to mask out invalid positions in the attention mechanism rather than using PackSegments and UnpackSegments. This should help us on several fronts, including elision of host<>device copies and using fewer intermediate blobs
Differential Revision: D5619156
fbshipit-source-id: e59c644236cee02f853d8743f9a938fb10adc73b
Summary:
Implement forward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72.
This implements two modes: a sequence-length based mode and a matrix triangle mode.
Reviewed By: akyrola
Differential Revision: D5615493
fbshipit-source-id: a2ce4a8e655d9b720049010a7856be052c5567eb
Summary:
The LocalSession does not work with the multi-node definitions.
The test becomes flaky because of that. The fix is to create
different LocalSession for each Node(), and run each node
sequentially.
Differential Revision: D5617857
fbshipit-source-id: a8079a90291b4c8b5aa6b471c33c06d18e59976c
Summary:
1. Adds one more step in the JobRunner class to upload checkpoints.
2. Adds one function to return the name of the checkpoint given
the name of the node.
Reviewed By: andrewwdye
Differential Revision: D5597130
fbshipit-source-id: 570a55785e6227859e1115326d6cab077f0e7f72
Summary: Added Nesterov momentum as an option for BMUF and corresponding tests
Reviewed By: asaadaldien
Differential Revision: D5599888
fbshipit-source-id: 30819c9e689347c8b75daddc7444bea9f54193ae
Summary:
Add support for TensorCore convolution and gemm on Volta hardware.
Currently built on top of #1055
Closes https://github.com/caffe2/caffe2/pull/1056
Differential Revision: D5604068
Pulled By: Yangqing
fbshipit-source-id: 100f67e26ed5fabb1dbb31dcd77f7ecb84de4ee7
Summary: Guarding reservoir sampling with mutex & fix the bug in counting number of new entries.
Reviewed By: chocjy
Differential Revision: D5503300
fbshipit-source-id: fd6b0bacb71fbab99d6d5df2c72da523fba02847
Summary: Adding the option to dedup by object ID so that more frequent objects are not present more than once in the reservoir
Reviewed By: chocjy
Differential Revision: D5503109
fbshipit-source-id: e36c3ad8eea134d6c10a4c875fceadc0f843c976
Summary: Make the candidate pool less localized
Reviewed By: chocjy
Differential Revision: D5453289
fbshipit-source-id: 848cb7551d7112f6f47f2cf647bb0daca6eff341
Summary: Instead of printing the exception using print() use traceback.print_exc() This way you get a stack trace
Reviewed By: jay-mahadeokar
Differential Revision: D5604642
fbshipit-source-id: f8cb67e554305cd2fbed384a4a2040fa2b16e7c0
Summary: Make the command-line arguments pertaining to model architecture the same as between train.py and translate.py. Also use s() scoping function for all intermediate blobs in attention.py (this is for comatibility with multi-headed attention).
Differential Revision: D5594312
fbshipit-source-id: cadf51d854b5a9174ec913f32c655be2abf111e5
Summary: In order to control the absolute scale/magnitude of the output of this op, added a tuning parameter: amplitude
Reviewed By: jamesr66a
Differential Revision: D5596574
fbshipit-source-id: 3b7e316de55cce6fd686da70aa5658ec3e99b070
Summary: GRU is different than LSTM that it only has hidden states but no cell states. So in this case, reusing the code of _LSTM is problematic, as we need to delete the part of creating cell state, and change many other places that use hard-coded 4 (hidden_all, hidden, cell_all, cell) into 2 (hidden_all, hidden). Otherwise GRU will break during the backward pass, when the optimizer tries to apply gradient to each of the parameters, because cell state is never used, so it does not have gradients for the corresponding parameters (i.e., cell_state_w, cell_state_b).
Differential Revision: D5589309
fbshipit-source-id: f5af67dfe0842acd68223f6da3e96a81639e8049
Summary:
Model downloader was broken after the move on s3 to the vanity url, download.caffe2.ai. Using this as the url base hits a redirect, and will result in the script throwing a 403 error. Rather than upgrading to urllib2 or putting in a bunch of code to handle a redirect on urllib, we can just use the non-vanity base url.
Closes https://github.com/caffe2/caffe2/pull/1020
Reviewed By: Yangqing
Differential Revision: D5568686
Pulled By: aaronmarkham
fbshipit-source-id: d88a6b3e1b7955835fc03b036dc54dec48316e7f
Summary: as promised, a separate diff for dpm changes I made in experimental code
Reviewed By: pietern
Differential Revision: D5551304
fbshipit-source-id: 9013aeab6c388b1c415ffb2e36fb8dd6b8cf90b0
Summary: This diff implements CUDA version of OneHot operator.
Reviewed By: bddppq
Differential Revision: D5578543
fbshipit-source-id: 55b70e8ec6ee34b647b9140fecbba31b6968f403
Summary: Add CUDA version of GRU operator
Reviewed By: jamesr66a
Differential Revision: D5571043
fbshipit-source-id: 332aa64fc8a9116cc33382f2b2907080e58c13b3
Summary:
Fix multilayer inference in Caffe2 example seq2seq code. (Rely on LSTMWithAttentionDecoder.apply rather than fixed state indices to determine stepwise decoder output.)
Also assorted updates to bring code in line with changes elsewhere in the codebase, and added unit tests which ensure that training and inference networks generate the same loss, which should make these problems much easier to identify in future.
Reviewed By: jamesr66a
Differential Revision: D5579803
fbshipit-source-id: 6e0f27340d981990ab8d0da58e63793222e7be87
Summary:
It was reverted previously because of lack of schema for gradient op. Added it back and resend.
difference between this diff and previous reverted diff:
1. added schema for gradient operator
2. change line:95 in kmax_pooling_op.h from CAFFE_ENFORCE to CAFFE_ENFORCE_GE
Reviewed By: xianjiec
Differential Revision: D5568867
fbshipit-source-id: 39813b389a5da803967a561249793afdfce00c58
Summary:
In Python 3x dictionary values aren't a list and can't be concatenated to a list
this diff should fix that.
Reviewed By: andrewwdye
Differential Revision: D5576724
fbshipit-source-id: c60441857ceceb9c4a71122d2db5e9abad6d3fc2
Summary:
The L1Distance operator used to return a single value denoting the L1 of the entire input, instead of a vector for each input value.
This fixes that.
Reviewed By: Yangqing
Differential Revision: D5570385
fbshipit-source-id: fbab0e0c9262ccbdb3af27262b8baacdeb2d0fc9
Summary: New hybrid randomized sparse nn, which allows layers of sparse NN model to be randomized, semi-random, or learnable
Reviewed By: chocjy
Differential Revision: D5416489
fbshipit-source-id: eb8640ddf463865097ba054b9f8d63da7403024d
Summary:
To train an image model, we also can use label embedding vector as supervision as opposed to using SoftmaxLoss/SigmoidCrossEntropyLoss.
In such case, the label is a dense vector. This diff enables such use cases.
Reviewed By: panshen1
Differential Revision: D5556203
fbshipit-source-id: 52c61495e02fab457dc2d43e3345d7dbd5580ab7
Summary:
data_workers.py provides a really nice, easy way to run background threads for data input. Unfortunately, it's restrictive, the output of the fetcher function has to be a numpy array.
I pulled out that core nice thread management into parallel_workers, and updated the classes data_workers to extend those classes. The main change was refactoring out most of the queue handling logic into QueueManager.
This way parallel_workers can be used to manage background threads without having to use the queue for output.
Reviewed By: akyrola
Differential Revision: D5538626
fbshipit-source-id: f382cc43f800ff90840582a378dc9b86ac05b613
Summary:
Implement dot attention as described in https://arxiv.org/abs/1508.04025
This saves the computation of weighted encoder outputs in `rnn_cell.py`
When the encoder and decoder dimensions are different, we apply an FC, which corresponds to the general case below Figure 2.
Refactored unit tests.
Reviewed By: jhcross
Differential Revision: D5486976
fbshipit-source-id: f9e9aea675b3b072fbe631bc004199b90a9d95cb
Summary:
Caffe2: add a DB that's wrapped around a BlobsQueue as an adapter for data from non-DB interface.
This is useful for bridging the gap between DB interface data processing ops (TensorProtosDBInput, ImageInputOp etc.) and data that's coming from arbitrary Python or the pretty intricate Hive reader.
Reviewed By: akyrola
Differential Revision: D5554560
fbshipit-source-id: 01bb0056410f9ade205367d5fefc721f91f5b629
Summary:
The current implementation for s=0 doesn't support backward pass.
Switching to using pow op instead as a temporary solution.
Reviewed By: jackielxu
Differential Revision: D5551742
fbshipit-source-id: 33db18325b3166d60933284ca1c4e2f88675c3d3
Summary:
This brings it up to par with how the RedisStoreHandler
works. The store handler configuration does not have to change and
only the run ID parameter changes across runs.
This was inconsistent and came up in https://github.com/caffe2/caffe2/issues/984.
Reviewed By: Yangqing
Differential Revision: D5539299
fbshipit-source-id: 3b5f31c6549b46c24bbd70ebc0bec150eac8b76c
Summary:
This diff makes SparseLengthsSum(Gradient) Async. It goes through these logics:
1. Adding INDICES to Gradient op input so that we can make it async without device host copies.
2. Registering new 3 input op as gradient for CPU/GPU version of SLS
3. In order to not breaking old nets(they are mostly on cpu), I still register the old 2 input op. So the op schema will not complain when it encounter some old nets that has SLSGradient op in it.
wickedfoo Sorry this diff might bring you extra work of migrating your optimization effort to this new async gradient op. But we think it is worth it. :(
Reviewed By: dzhulgakov
Differential Revision: D5423188
fbshipit-source-id: 62494a6c52a507c4a4688d5a9e1a2bc720d5370d
Summary: Added caffe2 operator to calculate the sinusoidal position encoding for word embeddings, as described on page 6 in https://arxiv.org/abs/1706.03762.
Reviewed By: jamesr66a
Differential Revision: D5533024
fbshipit-source-id: 1afb35cd7f9d8c71f2635b853e56b2c840f0bc1f