Summary: CUDA version of the AddPadding op. It first executes a prefix-sum using Cub to compute the cumulative lenghts array. Then it launches a kernel that uses this information to fill the output tensor with start, end paddding and the actual contents.
Reviewed By: asaadaldien
Differential Revision: D6391413
fbshipit-source-id: 45b431e5976674729e53cb4752c7753c1d8a69e8
Summary: Cast op cuda can deal with empty batch now.
Reviewed By: azzolini
Differential Revision: D6350138
fbshipit-source-id: 2f3d19f4d42ff34806aa9597690e66f6b4de1a6b
Summary:
Two ops: BatchSparseToDenseOp and DenseToBatchSparseOp
Inverse operations of each other.
Details are described in op Doc
These op is used along with flexible topK, where the output is
lengths, indices, and values.
We want to do softmax on the values, but the dimension of each batch is different. So these op will convert sparse representation to dense and vice versa. The two ops are also gradient op for each other.
Reviewed By: chocjy
Differential Revision: D6288338
fbshipit-source-id: 0ba9e611058b39e46e7414dcc5f39cab29915fa3
Summary:
This is part one: It adds lambdaNDCG loss which can be used to heuristically
optimize the NDCG metric.
Differential Revision: D5830650
fbshipit-source-id: 1eb696337c9a77727ad40219c68f6468e2e097a5
Summary:
Datatypes was being handled badly in reference check, causing sporadic fails in CI. All batched mat-mul with fp16 data is performed as pseudo-fp16, with all math in fp32. Adjusted the reference implementation to reflect this.
Adjusted the gradient check threshold to the best I could get to consistently pass.
Closes https://github.com/caffe2/caffe2/pull/1406
Differential Revision: D6324431
Pulled By: pietern
fbshipit-source-id: 83ff2584438a11f7a6db4599a4fb0e75e9e15a3d
Summary: add NegateGradientOp: in forward pass, this op simply copies the input to output. In backward pass, it flips the sign of gradients.
Reviewed By: dragonxlwang
Differential Revision: D6314456
fbshipit-source-id: 56afd8b131eff9f7e120ab7e4e87461df49649d4
Summary: The topk GPU test was taking too much time, but there are still a variety of codepaths to test (k <= 1024, k > 1024, k == 1, k == n). Reduce the batch sizes and n to reduce time taken by the in-python CPU code equivalent.
Reviewed By: pietern
Differential Revision: D6272628
fbshipit-source-id: b8b8f3601f28bf64f144c73d7c9e915f40c84d70
Summary: The number of elements in the caffe2 blob can be larger than int32. Use size_t to prevent overflow.
Reviewed By: ajtulloch
Differential Revision: D6278363
fbshipit-source-id: 356e294c667a53360d8a65b56a63a39d5ce3384e
Summary:
Will probably rename to adaptive topK to be aligned with the layer name.
The main difference from top_k op is that the K is not fixed as a layer parameter,
instead this op takes in a blob that conatins K information for each row of the input data (batch mode).
Reviewed By: chocjy
Differential Revision: D6221209
fbshipit-source-id: f7fd575ff8f515d886d93278ad94fd17e8bd6fa5
Summary:
This seems to be faster in a bunch of cases. Prefer to keep it as a
separate op instead of MatMul + Add so its easy to compare perf on per
op basis between this one and the baseline (normal FC)
Reviewed By: akyrola
Differential Revision: D6169187
fbshipit-source-id: 09b96325d44bd181896f396aec88b27314c435b0
Summary: Before the boundary checking was happening after the first access for 8bit ops.
Reviewed By: Yangqing
Differential Revision: D6206753
fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b
Summary: Updated brew SpatialBN to use initializers similar to other brew ops such as conv and fc instead of initilaizing all of its parameters itself within the brew call.
Reviewed By: asaadaldien
Differential Revision: D5840359
fbshipit-source-id: 9f3d688d4957605eaf7ecd2488bc26bfb1da3f78
Summary:
Implemented new CUDA class for operator SparseAdagrad. The param and moment inputs now can be float or float16.
The functions for mixed-precision add/mult/store are defined in a separate head file ("caffe2/core/float16_util.h") for reuse purpose.
Reviewed By: azzolini
Differential Revision: D5880200
fbshipit-source-id: dca227f38629a03a9d771f42efe2c0b673075c4d
Summary: Allow the GEMMs in the FC/FCGradient Op to do FP16 compute instead of FP32 if the appropriate op flag is set.
Reviewed By: asaadaldien
Differential Revision: D5839777
fbshipit-source-id: 8051daedadf72bf56c298c1cf830b019b7019f43
Summary: Given an additional tensor containing the values corresponding to the weighted samples, add tensor output that contains the values selected by the sampled indexes.
Reviewed By: akyrola
Differential Revision: D6050094
fbshipit-source-id: 1eccc641b99e30d36ae83d49f630b018a53e4147
Summary:
Added two new ops, FP16MomentumSGDUpdate and FP32MomentumSGDUpdate, which perform both the momentum sgd and weight decay updates to a given parameter in a single op -- thus being more efficient.
Also updated the standard momentum sgd test to test if nesterov momentum works.
Reviewed By: asaadaldien
Differential Revision: D5837837
fbshipit-source-id: 5ad487b9c59434491d3a4fcfdeed820db6083f57
Summary: Adding "dtype" parameter for the GivenTensorOp. Also, providing backwards compatibility for the existing code, byt supporting the templating if "dtype" is not provided.
Reviewed By: bddppq
Differential Revision: D6090049
fbshipit-source-id: f5deaa57b49f2280289975f4583aba5bc064a2bc
Summary: CUDA version of weighted sampling operator; minor changes for CPU version
Reviewed By: asaadaldien
Differential Revision: D6106668
fbshipit-source-id: 42d7607bd845a4a39cf5b89d7476904cb5928431
Summary: Before we fix it properly with 'type' argument.
Reviewed By: bddppq
Differential Revision: D6103973
fbshipit-source-id: 8c00a93c373dd0ad0bbfe59944495f6574223ab6
Summary:
Currently, the type inference infers FLOAT as the type for all GivenTensor*Fill operators. However, the inferred type should match the actual operators.
Also, for `Slice` operator, there is a corner case where type inference fails
Reviewed By: azzolini
Differential Revision: D6096813
fbshipit-source-id: d65b7c0f42436138cbc49d8a5a62374fa5e927e1
Summary: Allow the application of sequence-length masking to be replicated along one or more minor axes. See task for details.
Reviewed By: jamesr66a
Differential Revision: D6090835
fbshipit-source-id: 9064232aa9b93246c582b6e0bae73be5dbe09e98
Summary:
Op for computing SigmoidCrossEntropyWithLogits with per-label, per-sample weights. Can be used for addressing class or label imbalance.
Doc:
Given three matrices: logits, targets, weights, all of the same shape,
(batch_size, num_classes), computes the weighted sigmoid cross entropy between
logits and targets. Specifically, at each position r,c, this computes
weights[r, c] * crossentropy(sigmoid(logits[r, c]), targets[r, c]), and then
averages over each row.
Returns a tensor of shape (batch_size,) of losses for each example.
Reviewed By: stephenyan1231
Differential Revision: D5997723
fbshipit-source-id: f3172325f1c98b6f26e1700131ef897b743a72fc
Summary: Turns out CuDNN's tensor transform only supports floats. Previous implementation pretended it would work with ints by casting to floats and indeed passed tests for some reason. But rgirdhar found a case where it returned nonsensical results. So rewire int-transposes to use non-cudnn version. Had to refactor a bit for that. Also added a test for the case.
Reviewed By: asaadaldien
Differential Revision: D6043284
fbshipit-source-id: cc3b14f9fbbdeff421b01da453a1d3c7c5ffd4ac
Summary:
input dimensions up to "axis" will be flattened to the outer dim of output and the remaining input dims will be the inner dim
Closes https://github.com/caffe2/caffe2/pull/1330
Reviewed By: dzhulgakov
Differential Revision: D6039560
Pulled By: bddppq
fbshipit-source-id: e92c30b49a9288feeefc4a639522406e97e149e1
Summary:
Optionally return a blob of shape [batch size, max length] that is
false only in locations where the output tensor was padded.
One can separately convert lengths to segment ids and cast, but
this is more convenient, and possibly more efficient.
Differential Revision: D6006073
fbshipit-source-id: af6c4ea31972566e7d059dcd3fdd8afba97a88e9
Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well.
Reviewed By: salexspb
Differential Revision: D5949330
fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f
Summary:
Input is a matrix tensor. Its first dimension is the batch
size. For each column, bucketize it based on the boundary values and then do
one hot encoding. The `lengths` specifies the number of boundary values for each
column. The final number of buckets is this number plus 1. This would also be
the expanded feature size. `boundaries` specifies all the boundary values.
Note that each bucket is right-inclusive. That is, given boundary values
[b1, b2, b3], the buckets are defined as (-int, b1], (b1, b2], (b2, b3], (b3, inf).
For example
If data = [[2, 3], [4, 1], [2, 5]], lengths = [2, 3],
and boundaries = [0.1, 2.5, 1, 3.1, 4.5], then
output = [[0, 1, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1]]
Reviewed By: xianjiec
Differential Revision: D5976030
fbshipit-source-id: fd746c20b19bcdf5f769451d804c219ad6463f28
Summary: adding an operator with behavior similar to fused GatherRanges and Split.
Reviewed By: kennyhorror
Differential Revision: D5961761
fbshipit-source-id: 616d4668b8901256418004def90d91a0b2041620
Summary:
Added support for batching to SequenceMaskOp.
Let b be the batch dim and k be the axis dim. (We enforce that b < k.) Write the dimensions of the input tensor as [a_1, ..., a_b, ..., a_k, ...]. We first collapse our tensor down to 3D, with dimensions [P, Q, D], where: P = a_1 * ... * a_b, Q=a_{b+1} * ... * a_{k-1}, and D=a_k * a_{k+1} * ... * a_n. Then we mask each slice [i, :, : ] of this 3D tensor (note that each slice is a Q times D tensor w/ dimension 2)
Reviewed By: jamesr66a
Differential Revision: D5733382
fbshipit-source-id: e7a314d9fe6e6691a75112edbee8ba6e8ea8e396
Summary:
This diff implements deformable convolution operator. The idea behind it is that instead of using a fixed NxM kernel, we associate a set of learnable offsets (dx, dy) with each element of the kernel, and use bilinear interpolation to estimate weights in between the integer indices. For background see paper https://arxiv.org/abs/1703.06211 and mxnet implementation https://github.com/msracver/Deformable-ConvNets/tree/master/rfcn/operator_cxx
To simplify code review of the new files the feature is stacked into 2 diffs. First diff duplicates core convolution operator into a separate set of files prefixed with deform_. It also provides documentation on the operator but nothing else. Second diff contains the actual changes that make deformable convolution possible. Thefore, I recommend focusing your code review on changes between diffs 1 and 2.
Current limitations of the operator:
1. Only CUDA is supported. CPU version is not implemented.
2. Only NCHW layout is supported.
3. Only 2d convolution is supported.
CUDA code is ported from mxnet implementation with minimal changes.
See also inline comments in code for tricky parts.
Reviewed By: akyrola
Differential Revision: D5702983
fbshipit-source-id: 4d1bf2c6c73135e6a70dbe87037b38915f4453f9
Summary: Implementation of ReduceFront/Back/Max/Gradient for CPU and CUDA.
Reviewed By: asaadaldien
Differential Revision: D5905402
fbshipit-source-id: 6967ce41aa95ee5ea7a90065430892e81a6da477
Summary: Implemented version of SparseAdagrad that only keeps track of an average sum of squared gradients term for each row of the parameter tensor, rather than a sum of squared gradients term for each individual parameter.
Differential Revision: D5881918
fbshipit-source-id: bd96ccf25554b457baaaca9309fc8048adbb37f7
Summary: Equivalent to numpy.sign for CPU and CUDA.
Reviewed By: dzhulgakov
Differential Revision: D5906446
fbshipit-source-id: 389f994bccbb87a62df2c4aaacc327f9a6223cbd
Summary: Can be used to gather outputs of a sharded "Gather", or for the SparseLengthsSumGradient when we need the gradient on values.
Reviewed By: akyrola
Differential Revision: D5800901
fbshipit-source-id: 90835755d6d15be13fb0f538cfade980cf4a1cd2
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: kennyhorror
Differential Revision: D5870907
fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b
Summary:
Two implementation of max pool reducers had different semantics in case of equal indices. It matters less in real cases, but breaks tests. Choosing the behavior of LengthMax over SortedSegmentRangeMax as the former is more widely used.
Also some minor tweaks for the test code.
Reviewed By: Yangqing
Differential Revision: D5870386
fbshipit-source-id: 6488cbd5cacaf595ffc07c44084730dd44b3f9dd
Summary:
Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.
Performance Results
===================
Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases.
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum
[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum
Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.
However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
For every pair of scale and bias, we bring entire cache line of
64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/
To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)
block_size time(uint8) time(float16) time(float32)
64 0.19 0.09 0.17
128 0.12 0.09 0.17
256 0.70 0.09 0.14
1024 0.50 0.06 0.10
The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.
Reviewed By: dzhulgakov
Differential Revision: D5824641
fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6
Summary: Introduced weight for labels in multi-lable setting. An extra weight blob is introduced and read in the operator in case lable setting is weighted sparse.
Reviewed By: kevinwilfong
Differential Revision: D5812467
fbshipit-source-id: efb209092e1e9effc915b0a753fa0c67b47a4fb6
Summary: PR 1175 caused a build error because gemmBatched was only under a specific #ifdef. Now put it outside the #ifdef, and things work.
Reviewed By: asaadaldien
Differential Revision: D5834868
fbshipit-source-id: 072a64c8f4b259ff7504104121766115b46b8aa0
Summary:
Also add the ability to mark an argument as required.
Added a string constant `OpSchema::Arg_IsTest` for `is_test` arg.
If users define the `is_test` argument with `ArgIsTest(...)`, then it automatically becomes required argument, in the meanwhile user can still use `Arg("is_test", ...)` to define an optional `is_test` argument.
Reviewed By: akyrola
Differential Revision: D5812391
fbshipit-source-id: eaaba50d027813a8012389edc6c459de23c3c728
Summary: For data parallel we need the batch size to be multiple of nubmer of replicas. In order to do so with this diff we do Dataset(rec).trim(multiple_of=num_replicas)
Reviewed By: dzhulgakov, harouwu
Differential Revision: D5753861
fbshipit-source-id: c5d728b925707dbd3d1f500a93e67e185c223569
Summary:
Computes a fixed grid or RMAC region coordinates for a given 4D feature tensor
(NCHW) as described in https://arxiv.org/abs/1511.05879. The output is the
`roi` format expected by RoIPoolOp. To compute the actual RMAC itself, the
output of this op should be passed to RoIPoolOp.
Reviewed By: wickedfoo
Differential Revision: D5594994
fbshipit-source-id: 5edac98a18137b53555f9a16354419b424679c99
Summary: The shape inference of distance_op has issues (only works when inputs are 1D tensors). This diff fix the shape inference and the unit test.
Reviewed By: kittipatv
Differential Revision: D5788744
fbshipit-source-id: cb1b7facf7b9ccd64b54edca156325eceef50f33
Summary: Filling in the gap in tensor inference
Reviewed By: sunnieshang, akyrola
Differential Revision: D5779550
fbshipit-source-id: 9ec68c9dad566183d7d0fc2819829c2b91430dda
Summary: As title. Wonder this had not been encountered before. Only affects cases where the states are copied over though.
Reviewed By: Yangqing
Differential Revision: D5777314
fbshipit-source-id: 8aef435c832e4ead5bb3d3e35bb065c734a2af5f
Summary:
Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs.
With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism
over timesteps. In my experiments, it was not good to use more than 2 streams, though.
Flag --caffe2_rnn_executor can be used to switch the executor off.
Reviewed By: salexspb
Differential Revision: D5749304
fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c
Summary:
As described in task T21337239, NormalizeOp currently normalizes over only the last dimension.
In this commit, the following changes have been made:
(1) Added an axis-parameter to NormalizeOp in both the CPU and CUDA context.
(2) Added the same axis parameter to NormalizeGradient in both the CPU and CUDA context
(3) Removed the limit that the original NormalizeOp operator requires the input dimension to be 2
Reviewed By: akyrola
Differential Revision: D5745162
fbshipit-source-id: 69e04f59ac4d954b0062c3b2a53c8ca465a1027b
Summary:
**Description**
Provide DeepText model with the functionality to load a secondary index (pre-trained char-ngram embedding, e.g. FastText) during training/test. Embeddings of out-of-vocabulary words will be computed on-the-fly during training/test by averaging the char-ngram embeddings.
**Approach**
This diff provides two custom operators to accomplish this task – ConditionalOp and IndexCharNgramGetOp. We first use IndexCharNgramGetOp to perform char-ngram index lookup and return a sparse tensor segmented by lengths for each token. The sparse tensor is then used to compute the average embedding provided by the char-ngram index. Finally, we use a ConditionalOp to replace those whose embeddings were not found in the original index during the feature apply stage. Please refer to documentations of the code for more details.
Reviewed By: jamesr66a
Differential Revision: D5666924
fbshipit-source-id: f76605d093154a014d5b9ebf9510de9d79874eee
Summary:
Implementation of a new variant of attention module, which contains a recurrent decoder state with vectors corresponding to each source-side word and strictly increasing values, thus enabling it to model the degree to which source words have been translated.
The approach is a variant of the approaches described in https://arxiv.org/pdf/1601.04811.pdf. We simply include the sum of all previous attention weights for encoder words as a new recurrent state (coverage_t). A new linear transform on encoder_outputs is used to produce coverage_weights, which has the same dimensionality as encoder_outputs, and implicitly models the fertility of source-side words (and putting this extra information strain on the encoder network).
Thus the encoder output, the decoder state, and the coverage weights have the same dimensionality for a given source word, and attention logits are calculated as v * tanh(coverage * coverage_weights + encoder_output + decoder_state).
Note: the entire coverage state for each translation instance is of shape (encoder_length, coverage_units), but the states for the RecurrentNetwork operator, used to train the decoder, must be flat in the data dimension. This state is therefore initialized with shape (encoder_length * coverage_units) [not shown in the open-source library] and reshaped appropriately within the apply_soft_coverage_attention() function.
Differential Revision: D5593617
fbshipit-source-id: 7d0522b5eb0b26f22e8429e4461a459f2f16ed46
Summary: Adding support to use kernels, strides, pads etc. as arguments.
Reviewed By: houseroad
Differential Revision: D5710699
fbshipit-source-id: 8b63af4c4a76cd06b637a376aeb29a34c659be2e
Summary:
_LSTM helper is a legacy piece we had before all the RNNCell awesomeness landed. Now we need to pull it apart and create separate building blocks that people can use for any RNNs.
Please note changes to a test with double scoping. That should go away once we change RNNCell scoping logic in such a way that each cells ads its own name to the scope for all of its outputs (see another diff: D5613139 )
Reviewed By: jhcross
Differential Revision: D5632276
fbshipit-source-id: 1cb568ab995c4c0b3dd1b4bad2d028e34bded9c1
Summary: These were missing and required for some seq2seq models. Unit tested. The previous implementation of ReduceBackMean shape inference was incorrect, so removed it.
Reviewed By: asaadaldien
Differential Revision: D5691262
fbshipit-source-id: 76f868b298440f988635966a410f0232301ca6c4
Summary:
Split the first dimension of a tensor into 2, the first of which is fixed and given in the argument.
This is used to then split batch into smaller batches and distributed it across workers.
Reviewed By: harouwu
Differential Revision: D5702175
fbshipit-source-id: 02bb93e49bf9db411b516e149c8e647301dd2ca5
Summary:
This adds a fast path for global max pooling with NCHW. Compared to equivalent ReduceBackMean, this is about 3.5x faster.
Based on D5533059.
Reviewed By: akyrola
Differential Revision: D5681122
fbshipit-source-id: 7a4df934044c7dd01888f095f7dd46654aaf4eae
Summary:
Optimizations for SinusoidPositionEncodingOp to sinusoid position embeddings
more competitive against table based embeddings.
- Removed most calls to std::pow
- Replaced division with multiplication with reciprocal
- Reused computation across examples within a batch
Current speedup with batch size of 16, sequence length of 128 and embedding
size of 512 is about 270x (17k embeddings per second -> 4.7M embeddings per
second). The speedup is very dependent on the batch size; at a batch size of 4
this only gets 1.7M embeddings per second.
Profile: https://pxl.cl/8zf0
Annotated DoRunWithType: P57925031
Reviewed By: jamesr66a
Differential Revision: D5634766
fbshipit-source-id: 0f35bb176164ea547c91de242a0205c5d7adf7cf
Summary:
Add more data augmentation to ImageInputOp
1) Inception-style random sized cropping
2) color jittering
3) color lighting
Reviewed By: panshen1
Differential Revision: D5637726
fbshipit-source-id: 45d9cc69eec9f4d48c1607d80ccd89e325961b1a
Summary:
Adding a range operator in the spirit of np.arange. It is an imporant building block for a lot of manipulation functions.
This accepts parameters with the same meaning in the same order as python's range or np.arange (e.g. `(stop)`, `(start, stop)` or `(start, stop, step)`)
Differential Revision: D5616861
fbshipit-source-id: 02622b8bd85ebca125cc881c06fae5b54b7c602a
Summary: The new test ensures 'add_axis' and 'split' arguments work as intended for tensors of various dimensions. Hypothesis should checks various edge cases like zeroes in 'split_info' and 1D input with axis=0, add_axis=1.
Reviewed By: hoangmit
Differential Revision: D5645778
fbshipit-source-id: 061f9511a082da54e5c1bbe53a0e7096af4b8d1b
Summary: Implement a brew wrapper for the LayerNorm op. This adds the scalar weight and bias terms to the op.
Reviewed By: jmp84
Differential Revision: D5595836
fbshipit-source-id: 467b2e1158b0c454a149d4b26c47719826e98752
Summary:
Forward-only mode had broken at some point. Two things: RNNCell did not pass the parameter to recurrent.py and also recurrent.py was broken if forward_only=True after python3 codemod.
Added test to rnn_cell_test to actually check the forward only parameter is passed to prevent future breakage.
Reviewed By: jmp84
Differential Revision: D5639306
fbshipit-source-id: b1bbc39d59c3f3734b2f40a1c2f3740c733e0bd4
Summary:
As an alternative to sharing embeddings, we want to explore merging the ID_LISTs in the net.
This commit adds an operator to merge many ID_LIST features into a single one.
Differential Revision: D5481523
fbshipit-source-id: 446121122a32de5682d5d75a165370bc8d776d03
Summary: This can be used for local attention to mask elements outside of a window
Reviewed By: jamesr66a
Differential Revision: D5643677
fbshipit-source-id: 92b33866258ccc7307d5bcf08234610aa3fb152d
Summary:
This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking.
Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well.
This diff also restores the link-ops that I unlanded earlier.
The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with.
Reviewed By: salexspb
Differential Revision: D5001637
fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8
Summary:
Implement forward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72.
This implements two modes: a sequence-length based mode and a matrix triangle mode.
Reviewed By: akyrola
Differential Revision: D5615493
fbshipit-source-id: a2ce4a8e655d9b720049010a7856be052c5567eb
Summary: In order to control the absolute scale/magnitude of the output of this op, added a tuning parameter: amplitude
Reviewed By: jamesr66a
Differential Revision: D5596574
fbshipit-source-id: 3b7e316de55cce6fd686da70aa5658ec3e99b070
Summary: GRU is different than LSTM that it only has hidden states but no cell states. So in this case, reusing the code of _LSTM is problematic, as we need to delete the part of creating cell state, and change many other places that use hard-coded 4 (hidden_all, hidden, cell_all, cell) into 2 (hidden_all, hidden). Otherwise GRU will break during the backward pass, when the optimizer tries to apply gradient to each of the parameters, because cell state is never used, so it does not have gradients for the corresponding parameters (i.e., cell_state_w, cell_state_b).
Differential Revision: D5589309
fbshipit-source-id: f5af67dfe0842acd68223f6da3e96a81639e8049
Summary: This diff implements CUDA version of OneHot operator.
Reviewed By: bddppq
Differential Revision: D5578543
fbshipit-source-id: 55b70e8ec6ee34b647b9140fecbba31b6968f403
Summary: Add CUDA version of GRU operator
Reviewed By: jamesr66a
Differential Revision: D5571043
fbshipit-source-id: 332aa64fc8a9116cc33382f2b2907080e58c13b3
Summary:
It was reverted previously because of lack of schema for gradient op. Added it back and resend.
difference between this diff and previous reverted diff:
1. added schema for gradient operator
2. change line:95 in kmax_pooling_op.h from CAFFE_ENFORCE to CAFFE_ENFORCE_GE
Reviewed By: xianjiec
Differential Revision: D5568867
fbshipit-source-id: 39813b389a5da803967a561249793afdfce00c58
Summary:
The L1Distance operator used to return a single value denoting the L1 of the entire input, instead of a vector for each input value.
This fixes that.
Reviewed By: Yangqing
Differential Revision: D5570385
fbshipit-source-id: fbab0e0c9262ccbdb3af27262b8baacdeb2d0fc9
Summary:
To train an image model, we also can use label embedding vector as supervision as opposed to using SoftmaxLoss/SigmoidCrossEntropyLoss.
In such case, the label is a dense vector. This diff enables such use cases.
Reviewed By: panshen1
Differential Revision: D5556203
fbshipit-source-id: 52c61495e02fab457dc2d43e3345d7dbd5580ab7
Summary:
Implement dot attention as described in https://arxiv.org/abs/1508.04025
This saves the computation of weighted encoder outputs in `rnn_cell.py`
When the encoder and decoder dimensions are different, we apply an FC, which corresponds to the general case below Figure 2.
Refactored unit tests.
Reviewed By: jhcross
Differential Revision: D5486976
fbshipit-source-id: f9e9aea675b3b072fbe631bc004199b90a9d95cb
Summary:
Caffe2: add a DB that's wrapped around a BlobsQueue as an adapter for data from non-DB interface.
This is useful for bridging the gap between DB interface data processing ops (TensorProtosDBInput, ImageInputOp etc.) and data that's coming from arbitrary Python or the pretty intricate Hive reader.
Reviewed By: akyrola
Differential Revision: D5554560
fbshipit-source-id: 01bb0056410f9ade205367d5fefc721f91f5b629
Summary:
This diff makes SparseLengthsSum(Gradient) Async. It goes through these logics:
1. Adding INDICES to Gradient op input so that we can make it async without device host copies.
2. Registering new 3 input op as gradient for CPU/GPU version of SLS
3. In order to not breaking old nets(they are mostly on cpu), I still register the old 2 input op. So the op schema will not complain when it encounter some old nets that has SLSGradient op in it.
wickedfoo Sorry this diff might bring you extra work of migrating your optimization effort to this new async gradient op. But we think it is worth it. :(
Reviewed By: dzhulgakov
Differential Revision: D5423188
fbshipit-source-id: 62494a6c52a507c4a4688d5a9e1a2bc720d5370d
Summary: Added caffe2 operator to calculate the sinusoidal position encoding for word embeddings, as described on page 6 in https://arxiv.org/abs/1706.03762.
Reviewed By: jamesr66a
Differential Revision: D5533024
fbshipit-source-id: 1afb35cd7f9d8c71f2635b853e56b2c840f0bc1f
Summary: Implement operators LpNorm, which is to calculate the Lp norm of a tensor for regularization(p=1or 2) . Currently, there are only operator L1Distance to calculate the l1 distance of two same-shape tenors. We want to make it take only one input and output the l1 loss. We would do the same for l2 loss. We also plan to implement l_{p,q} loss, but have not decided which p and q to take.
Reviewed By: xianjiec
Differential Revision: D5460051
fbshipit-source-id: d67a38fbc94afa52de26d4a53e4d2b7df3c50b6a
Summary:
KaimingHe debugged slow model, and found out that global average pooling was hideously slow, even with CUDNN. Turns out CUDNN pooling op (especially backward pass) is not optimized for global pooling.
This adds a fast path for global average pooling with NCHW. This is about 30x faster than CUDNN with 56 x 56 pooling, Compared to equivalent ReduceBackSum, this is about 3x faster.
I will bootcamp the max pooling.
Reviewed By: asaadaldien
Differential Revision: D5533059
fbshipit-source-id: 2d590693d737fa92184603663031d96f6145f304
Summary: This allows users to add an arbitrary of additional outputs to ImageInputOp. These are populated by reading additional TensorProto values from the TensorProtos from the DBReader, and converting them into Tensors. Similar to labels, only ints and floats are supported, and multiple values are supported.
Reviewed By: panshen1
Differential Revision: D5502019
fbshipit-source-id: 5a8b61b3a8549272a112e8e02cd613d8f9a271ba
Summary: Add tensor inference function for squeeze, refactor a bit
Reviewed By: asaadaldien
Differential Revision: D5518880
fbshipit-source-id: 5b8cb9154f5f777d4be3612a96d7ed76a9068c0c
Summary: The diff adds support for rank_loss operator to support computing loss for multiple sessions (batch).
Reviewed By: kittipatv
Differential Revision: D5515465
fbshipit-source-id: 55a01cd5ad21eaeae82875ad136c392fed0dbb26
Summary:
Optimised SparseLengthsSum (fp32) for now
1) Specialized reducer
2) created fast routine with prefetches, loop unrolling, block specailization and register tiling
3) added more variety of block sizes to segment_ops_test.py
Reviewed By: Yangqing
Differential Revision: D5392472
fbshipit-source-id: 8ed9baf1b12ec05bd391cabb390024e6bc60a6f6
Summary: to support an operation needed by D5507205
Reviewed By: xianjiec
Differential Revision: D5512522
fbshipit-source-id: a9b3a668c28eff71d1e106dbbb572184df4a7638
Summary:
Use smaller step size for GradientChecks and pass seed to help reproducing the
test from logged inputs.
Reviewed By: Yangqing
Differential Revision: D5505698
fbshipit-source-id: fc308efe72d535695ba628944aee1913ba16b2f1
Summary:
Moved distance_op_test from hypothesis_test to distance_op_test and
refactored
Reviewed By: akyrola, asaadaldien
Differential Revision: D5495104
fbshipit-source-id: 4a90c75eabeb380ae9d150d6258e9b5b0fbfc5ca
Summary: When creating parameters for modelhelper, we should use create_param instead of using param_init_net and model.params directly. The diff rewrite some of these cases in rnn_cell.py in order to make model._parameter_info and model.params consistent.
Reviewed By: kittipatv
Differential Revision: D5477724
fbshipit-source-id: 28c4aaf8f98d9d89125af6a42ad328008f0079e1
Summary:
Need it for some reference comparison for c2isl.
Also there's an argument that it might be faster on GPU with int32. Doesn't seem to be the case now, but haven't tested with Jeff's changes yet.
Reviewed By: kennyhorror
Differential Revision: D5405482
fbshipit-source-id: dc1a983dce5f06f1111c5634ec475647c94848cc
Summary:
In order to get dimensions right, correctly identify gradients, etc., DropoutCell should call the _prepare_output and _prepare_output_sequence methods of its internal cell for its own such methods.
This bug was identified by NVIDIA intern Syed Tousif Ahmed.
Reviewed By: akyrola
Differential Revision: D5483082
fbshipit-source-id: f6df5b4a0502ed0771056638aab219fb5cc7d964
Summary: TSIA - this makes it a bit easy to benchmark sparse lengths sum.
Reviewed By: dzhulgakov
Differential Revision: D5477844
fbshipit-source-id: 89e25c5e0dbf3538877ba1a9abc75a10abfa2757
Summary:
For RNN attention, we should not include the invalid parts of the encoder output (based on encoder_lengths) in the computation. This diff accomplishes that by forcing logits for those positions to be negative infinity.
Note that the this step can be bypassed by passing encoder_lengths=None, which is what we do for beam search, thus incurring no extra overhead for inference.
Reviewed By: jamesr66a
Differential Revision: D5402547
fbshipit-source-id: 1863d6050b5129e4df829c6357f0aa9ded0715dc
Summary:
Added operator RecurrentNetworkBlobFetcherOp that takes as input a scratch workspace name and prefix, and copies over all blobs in the scratch workspace into the global workspace. This essentially extracts all intermediate recurrent network computation for each timestep.
Added a wrapper in recurrent.py - retrieve_step_blobs(net, prefix='rnn') - which, when called after an rnn is run, will return a list of all blobs extracted from the net.
Reviewed By: akyrola
Differential Revision: D5421926
fbshipit-source-id: 0f35b466d77d3c719fb0e32de7dbcafc6c0d5225
Summary: Implemented python logic and tests to create an RNNCell for GRU. Uses the preexisting GRU Unit Op code.
Reviewed By: salexspb
Differential Revision: D5364893
fbshipit-source-id: 2451d7ec8c2eacb8d8c9b7c893bfd21b65fb9d18
Summary:
Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell.
Functions were created to mimic LSTM implementation as closely as possible.
Backwards pass implementations are defined in GRU_unit_op.{h, cc}
assertGradientChecks call added to gru_cell_test.py
Reviewed By: salexspb
Differential Revision: D5364856
fbshipit-source-id: 09cff4478091827763b40cc331e4e0abf0ec258f
Summary:
Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell.
Functions were created to mimic LSTM implementation as closely as possible.
Implementation defined in GRU_unit_op.{h, cc}
tests put in gru_cell_test.py, which import rnn_cell_test_util.py for sigmoid, tanh, and _prepare_rnn functions.
Reviewed By: jamesr66a
Differential Revision: D5363697
fbshipit-source-id: f9ba9fe0be01ffc868dd22027be8be4975b84998
Summary:
Moved sigmoid, tanh, and _prepare_lstm (renamed) to a util file.
Also renamed _prepare_lstm to _preapare_rnn since it is being used for both setting up and LSTM and GRU model.
The reason for this commit is to allow the creation of GRU Op and testing code without copying and pasting code for sigmoid, tanh, and setting up an rnn unit op mode.
Reviewed By: jamesr66a
Differential Revision: D5363675
fbshipit-source-id: 352bd70378031f1d81606c9267e625c6728b18fd
Summary:
numpy.random.rand generates samples from [0, 1) and therefore, the leaky relu test cases weren't testing negative inputs. Tests still pass after change.
Leaky relu can be used in-place, but gradient took X rather than Y. Technically, the result is no different as it's just used for a sign test in the gradient, but updated it to take Y to reduce confusion.
Differential Revision: D5390126
fbshipit-source-id: d0c428abbb2797eb33902a7d2a2f59d5e85daaa6
Summary: Added a CUDA implementation of the PiecewiseLinearTransformOp.
Differential Revision: D5378537
fbshipit-source-id: 38857f59f5cc52e16e1ecc97983a0b0b82a46c74
Summary:
# Added the gradients of the operation for both CPU and CUDA kernels.
# Unified variable names across all ops.
# Added reference implementation in numpy.
# The gradient check needs a larger stepsize to succeed, is that normal?
Reviewed By: akyrola
Differential Revision: D5313682
fbshipit-source-id: aceb92649e01c5caeba8774e678f9095502d396c
Summary: Added two operators that can be used to tranfer data into the input format of RNN and back.
Reviewed By: kittipatv
Differential Revision: D5329886
fbshipit-source-id: 07eac29416427b08c49989d4eeed50a6f18493a1
Summary:
This bug in the test was exposed by https://github.com/caffe2/caffe2/pull/861 (previously, the test was always using the cuDNN engine, regardless of the value of `engine`). This bug is now blocking https://github.com/caffe2/caffe2/pull/817.
```
____________________ TestConvolution.test_convolution_sync _____________________
...
if use_cudnn and requested_engine != 'CUDNN':
raise ValueError(
> 'When use_cudnn=True, the only engine you can specify is '
E ValueError: When use_cudnn=True, the only engine you can specify is "CUDNN"
```
https://travis-ci.org/caffe2/caffe2/jobs/247605579
Closes https://github.com/caffe2/caffe2/pull/881
Differential Revision: D5332619
Pulled By: akyrola
fbshipit-source-id: 63737768a155359ddbbef1da424fcbb94f86bd4e
Summary: This should make it so we no longer have super hacky DAG chains just to generate vectors of indices that could be specified at model creation time
Reviewed By: akyrola
Differential Revision: D5316707
fbshipit-source-id: 97bb3868b69e0c5a7f465c95f2e16ae0485dcc56
Summary: Implement slice gradient for CPU. Will soon port this over to GPU so NMT can use it
Reviewed By: akyrola
Differential Revision: D5309305
fbshipit-source-id: 8fb5f4e665f236ecce9227c5c0c302f5076b01ad
Summary: Adding a test to check computational integrity of networks constructed with AttentionCell using UnrolledCell.
Reviewed By: salexspb
Differential Revision: D5306915
fbshipit-source-id: 02acfd1011f7d3ee5fac21cc2778c4a486190c43
Summary: softmax_ops_test occasionally fails with gradient checks. Stabilize by setting the numpy random seed. Also reduce some dimensions for the large input test to make it run faster.
Reviewed By: harouwu
Differential Revision: D5292106
fbshipit-source-id: a21eec89e18d30ac7c5609dacf5d413e841841a6
Summary:
kmatzen why did you set the stepsize in ff84e7dea6?
The test is flaky before this change. Solid afterwards.
Closes https://github.com/caffe2/caffe2/pull/841
Differential Revision: D5292112
Pulled By: akyrola
fbshipit-source-id: c84715261194ff047606d4ec659b7f89dac3cbb1
Summary: As title. Pretty straightforward. Could actually run each kernel in parallel, but we can optimize later if needed.
Reviewed By: Yangqing
Differential Revision: D5278415
fbshipit-source-id: 29f59afe28f37fc4152ec7eb7cd6c1ab65f2cb8c
Summary: Ran into it while working on a dper benchmark. Apparently it works harmless even with empty tensors.
Reviewed By: akyrola
Differential Revision: D5273672
fbshipit-source-id: a968ae03a659d6c1a215f12cc35f7ba68448e833
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.
`E InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(20, 12, 22), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 43253760.`
https://travis-ci.org/caffe2/caffe2/jobs/243867951
/cc kittipatv
Closes https://github.com/caffe2/caffe2/pull/830
Differential Revision: D5276639
Pulled By: akyrola
fbshipit-source-id: 0c21be25ecd931837dc8b0c2cc17048f531350d1
Summary:
This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU.
There are two algorithm implementations:
-for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data.
-for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense.
Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library.
Reviewed By: jamesr66a
Differential Revision: D5248206
fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3
Summary: The old version used one block with 128 threads. Throughput was too low for the NMT use case (calculating squared gradient norms for every parameter), so this increases the throughput. Shaves 7% off CNN model training time per step
Reviewed By: wickedfoo
Differential Revision: D5263748
fbshipit-source-id: adc3bacd11e49ea00c60381d613d993050e899be
Summary: This makes it easier to gather top-K by group of rows. This is useful in the situation where we want to pick up top-K from batch of fixed length sessions. Let `N` be number of sessions, and `M` be number of examples in a sessions. We would have a batch of `N * M` rows. We can reshape the score blob to `N x M`, and use it as input to `TopK` to select top score for each session. However, without the new output, it's would be inconvenient to gather the rows corresponding to the top scores. The indices are in `[0, K-1)` range. The new output can be used directly as input to `Gather`.
Reviewed By: chocjy
Differential Revision: D5171459
fbshipit-source-id: 69f7b41456c3f9670650ae07afc8fef8328485e9
Summary:
The global StatRegistry doesn't get reset when the workspace is reset.
```
> self.assertTrue(len(workspace.FetchBlob('k3')) == 2)
E AssertionError: False is not true
```
https://travis-ci.org/lukeyeager/caffe2/jobs/240162665
/cc azzolini
NOTE: this error doesn't show up if you just run `stats_ops_test.py` directly. It shows up when you run other tests in the same session before this test:
```
pytest -v caffe2/python/
```
Closes https://github.com/caffe2/caffe2/pull/788
Differential Revision: D5259232
Pulled By: salexspb
fbshipit-source-id: 3c72633af6bb61c4fda62195298b1e9574b4cbef
Summary: Implementation of the SliceOp for CUDA
Reviewed By: akyrola
Differential Revision: D5254287
fbshipit-source-id: 0a1660e1aa161fd088a2d8f886e019c05a1919a2
Summary:
```
File "/data/caffe2/install/caffe2/python/hypothesis_test.py", line 1911, in test_batch_to_space
(w + 2 * pad) / block_size).astype(np.float32)
File "mtrand.pyx", line 1404, in mtrand.RandomState.randn (numpy/random/mtrand/mtrand.c:19843)
File "mtrand.pyx", line 1534, in mtrand.RandomState.standard_normal (numpy/random/mtrand/mtrand.c:20368)
File "mtrand.pyx", line 167, in mtrand.cont0_array (numpy/random/mtrand/mtrand.c:6127)
TypeError: 'float' object cannot be interpreted as an index
```
```
File "/data/caffe2/install/caffe2/python/operator_test/tile_op_test.py", line 101, in tile_ref
tiled_data = np.tile(X, tuple(dims))
File "/data/caffe2/venv/local/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 881, in tile
return c.reshape(shape_out)
TypeError: only integer scalar arrays can be converted to a scalar index
```
I also tested to make sure this still works with 0.11.
Closes https://github.com/caffe2/caffe2/pull/787
Differential Revision: D5248087
Pulled By: salexspb
fbshipit-source-id: eff69482a8eabb8ace330003fa326c832b53865f
Summary:
We need to support RNNs explicitly in ExtractPredictorNet, because they store sub-nets as strings in special arguments. When netdef argument arrive, we can generalize this a bit.
Added a test under rnn_cell_test to test that extracting an LSTM predictor net works correctly and sets the device option properly for the step net ops.
Reviewed By: yqwangustc
Differential Revision: D5236334
fbshipit-source-id: cd653427f8c440a14d94195a532d18276f94749a
Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed.
Differential Revision: D5224449
fbshipit-source-id: 2f60754ed3ba6ed16039c09019117ae3c3646ab2
Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed.
Differential Revision: D5166513
fbshipit-source-id: 748527c423a163fe55f914c08fff3adfc74a540c
Summary:
Static RNN allows to unroll an RNN into Caffe2 graph using all existing cell abstractions. In this diff I introduce several new tests that already caught a few bugs in our RecurrentNetworkOp gradient accumulation logic by comparing it to an unrolled version.
Another use case is perf - potentially we can run an unrolled net faster because DAGNet will have access to the whole graph. Same about memonger. But this work is not part of this diff
Reviewed By: akyrola
Differential Revision: D5200943
fbshipit-source-id: 20f16fc1b2ca500d06ccc60c4cec6e81839149dc
Summary:
`brew_test.py` is just plain broken. `core_test.py` doesn't work with pytest. `apmeter_test.py` and `top_k_test.py` don't work for CUDA builds.
Closes https://github.com/caffe2/caffe2/pull/765
Differential Revision: D5211817
Pulled By: Yangqing
fbshipit-source-id: 78ec5af35a3fa870978e4c9590210ade9e3bc5ac
Summary:
Neither dependency is required by the core Python modules.
OpenCV, in particular, is a pain to install (no pip package). Conditionally skipping this test will make TravisCI integration easier.
Closes https://github.com/caffe2/caffe2/pull/739
Differential Revision: D5211799
Pulled By: Yangqing
fbshipit-source-id: c6bdc8a17977f64f34e968fd9ab8c65161d2624d
Summary: Implements an APMeter operator (APMeterOp) to calculate AP for multilclass classification given prediction socres and labels. The Op takes a score tensor [nsamples x nclasses] and a label tensor [nsamples x nclasses], and outputs a float tensor of size nclasses as the AP for each class.
Reviewed By: akyrola
Differential Revision: D5082565
fbshipit-source-id: ae7304bc8fc999c361245b9aec38eb9a5f5eef4b
Summary:
Input of topK op: X (dense)
Output of topK op: Value and Indices (sparse representation)
Value will have gradient in some cases,
We backprop (copy) the gradient from sparse (d Value) to dense (d X)
Differential Revision: D5133461
fbshipit-source-id: 7bad55b60e8a22dfe0e51357ce2099d7f752c133
Summary:
It's causing problems inside docker containers:
`InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(5, 9, 10, 5), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 18432000.`
Closes https://github.com/caffe2/caffe2/pull/707
Differential Revision: D5162621
Pulled By: Yangqing
fbshipit-source-id: 55544210961cbc80828dca2cbeba6a5ace8cf8d1
Summary:
This warning becomes an error with https://github.com/numpy/numpy/pull/6271 (`>=0.12.0`).
```
caffe2/python/operator_test/tile_op_test.py::TestTile::test_tilewinput
/opt/caffe2/caffe2/python/operator_test/tile_op_test.py💯 VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
dims[axis] = tiles
/usr/lib/python2.7/dist-packages/numpy/lib/shape_base.py:873: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
return c.reshape(shape_out)
```
Closes https://github.com/caffe2/caffe2/pull/710
Differential Revision: D5160776
Pulled By: Yangqing
fbshipit-source-id: b264e0e389de5817a289db878c15e655f9fa2f09
Summary: If ConstantFill (or other fill op) is used in CUDAContext, with input_as_shape, the code crashes as it expects the shape be in CUDAContext but accesses the array in host code... We could fix this by copying the values from the CUDA tensor, but it is probably best to enforce the shape param is in CPU context. This is what this diff does.
Differential Revision: D5152766
fbshipit-source-id: 0629a189bd1d800c0b7c9dbc324b78d279efac0b
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand
Reviewed By: akyrola
Differential Revision: D5142049
fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
Summary:
I'll let y'all decide how you want to fix this (probably need a persistent curand buffer). Here's a test to verify the fix.
Closes https://github.com/caffe2/caffe2/pull/495
Differential Revision: D5148815
Pulled By: akyrola
fbshipit-source-id: e80dabe65230ddd32340f2d872cd8786ac960bf8
Summary: Refactored SoftmaxWithLoss by removing the code for spatial=1 mode and created a new op SpatialSoftmaxWithLoss that has the spatial mode implemented.
Reviewed By: viswanathgs
Differential Revision: D5104120
fbshipit-source-id: 8ab999e32c916b2a39a670a7b2a3365401535f24
Summary:
To make optimizer for sparse gradients work with CUDA, we need UnsortedSegmentSum and Mean implemented for CUDA. Unique was already implemented by harouwu.
Pretty straightforward implementations, should be fast enough -- and i don't know a faster way anyway.
Added some tests as well.
Reviewed By: asaadaldien
Differential Revision: D5124548
fbshipit-source-id: 63ae72f45fc2f07470603f7b2de12f34635dbb3d
Summary:
Implement SizeOp that returns the number of elements in the input
tensor.
Output is 1D tensor that contains the number of elements
Reviewed By: akyrola
Differential Revision: D5101061
fbshipit-source-id: d1c56053b6f3b41c65ac574dd748482775d1ea0d
Summary: Gradient test for tile op was flaky because i had made the dimensions too large. This caused push blocking errors. Also I noticed my test_grad_tile was incorrect.
Reviewed By: asaadaldien
Differential Revision: D5126476
fbshipit-source-id: ae9ce5d9041648d7a4535fc88d4013e669bd6f02
Summary: As noted by salexspb, MultiRNNCell had unreliable gradient computation. The problem was that recurrent gradient and gradient computed wihtin the backward step net were not being accumulated during the backward pass, but rather writing to the same blob, thus overwriting each other. This diff fixes that by artificially introducing an extra blob for the internal output, and then accumulating it into the gradient coming from the recurrent connection.
Reviewed By: salexspb
Differential Revision: D5110059
fbshipit-source-id: 16add50989fe8866361bbc21afce5f214c5292fd
Summary:
I had "optimized" the number of threads / block, but cub::BlockReduce has a static template parameter for the number of threads, and this must match. Probably tests still passed because typically the initial numbers are zeros.
Also added a stronger test.
Thanks ves for the report.
Differential Revision: D5110901
fbshipit-source-id: c1169b1286e204c202b0727448ddb51b4965eacb
Summary:
deprecate CNNModelHelper in python/operator_test dir
BTW I found that there is 2 mkl_speed_test. I am confused...
Reviewed By: salexspb
Differential Revision: D5094122
fbshipit-source-id: f6526f4de334f2245eb4c1f204a8ec9f23750d78
Summary:
CUDNN dilated convolution was added to V6. This version of CUDNN does not support NHWC for dilated convolution.
Fix conv_test.py so that it does not test CUDNN for dilated convolution in NHWC format.
Closes https://github.com/caffe2/caffe2/pull/598
Reviewed By: akyrola
Differential Revision: D5084835
Pulled By: asaadaldien
fbshipit-source-id: 3c0c5ed02c5d9232fca567e387ab6260d71e5aaf
Summary: I noticed that Sigmoid was taking an inordinate amount of time in our NMT benchmark, so I looked at the implementation and it didn't seem optimal. I replaced the implementation with an Eigen version so that when the Eigen update goes through, we will get proper AVX(2) vectorization.
Differential Revision: D5082464
fbshipit-source-id: aa951f7d730fc05198f7dd04076ec58d471b74c8
Summary: Added L1Distance Operator for CUDA, as well as tests.
Reviewed By: bwasti
Differential Revision: D5071966
fbshipit-source-id: 4c3d862605e9123d955bf091efa67d0731bd816a
Summary:
Migrate experiments folder to fb/sparse folder. Keep FunHashOp and SparseFunHashOp because they are now assumed as a default Op in depr. What I did
# Migrate FunHashOp and SparseFunHashOp and their unitests to core-caffe2, make sure tests are passed.
# Migrate other Ops in experiment folder to fb/sparse folder. Write new TARGETS files for them. Make sure tests are passed.
# Make sure all related tests passed.
# Fix MKL definition btw. Make sure that FC_Sparse is not compiled when there is no MKL support
Reviewed By: salexspb
Differential Revision: D4952993
fbshipit-source-id: 86c03676ab4e47f04d2d0dd438a4a1c849bbbff0
Summary:
Generalize SpatialBatchNorm CPU Op to compute Spatial batch normalization for
1D, 2D & 3D input tensors.
Reviewed By: dutran
Differential Revision: D5043563
fbshipit-source-id: 7fcb933a628dd47f13aa622f63601a87382f09cd
Summary:
Added several features to the ImageInputOp:
- bounding box (per image as well as default for the operator). For per-image, it
only works in Caffe2 format and is passed as the third tensor in the form
(ymin, xmin, height, width). For the operator, pass bounding_xmin, bounding_ymin,
bounding_width and bounding_height as parameters.
- per-channel mean/std. You can use the usual mean/std to pass a single
value to be used for all channels or also pass mean_per_channel and std_per_channel
to specify different values per channel. Order of channels is BGR.
- A minimum size parameter that can be specified instead of the scale parameter.
The minsize parameter will only scale the image if it is smaller than required.
This differs from scale which will scale up as well as down. You can only specify
one of scale or minsize.
Added a test case to test some of the features
Differential Revision: D4874988
fbshipit-source-id: 437191052a46e9916defe8b100d7cc7864373f61
Summary:
cuDNN versions of dropout and LRN (for native fp16 support), port of Caffe's max pooling algo that uses an explicit mask to store locations (also supports fp16 storage)
Closes https://github.com/caffe2/caffe2/pull/396
Reviewed By: akyrola
Differential Revision: D4990880
Pulled By: asaadaldien
fbshipit-source-id: a716acffb656843e9b31e3e6808bd2d8aa959d03
Summary:
Incorporating definition of cell's output and illustraing it's usage by adding dropout to all types of cell.
I think that we should try to get rid of aliases in RecurrentNetwork, so output of applied_over_sequence is also always (state_1_all, state_2_all, ...). This way we can merge get_output_from_single_step, get_output_from_sequence and get_outputs_with_grads into a single method
Let me know what do you think!
Reviewed By: jhcross
Differential Revision: D4992913
fbshipit-source-id: 737939be336ad145f84e8733cd255d4f7188ef70
Summary:
Specialized implementation of ResizeNearest for width_scale=2 and height_scale=2. This implementation doesn't use divides or calls to std::min, and is unrolled 2x over the width dimension. Also add a correctness test.
About 6x faster.
Reviewed By: ajtulloch
Differential Revision: D4928579
fbshipit-source-id: 5cc92a52bd688690fee907b4333d9c84b666f9c9
Summary: Adding a simple video data layer which allows to read video data from frames, videos and output 5D tensor. It also allows multiple labels. The current implementation is based on ffmpeg
Differential Revision: D4801798
fbshipit-source-id: 46448e9c65fb055c2d71855447383a33ade0e444
Summary:
This diff creates a generalized AttentionCell class, which will allow us to construct attention decoders out of arbitrary RNNCell components (with a particular view to using stacked, multi-layer RNNs).
In order to do this, we introduce a new optional input for RNNCell._apply which allows us to provide an additional input that is not processed by prepare_input(). Note that this is an argument only to _apply, not apply, since it is only meant to be used for additional recurrent connections to "embedded" cells, not for standalone RNNs.
Reviewed By: urikz
Differential Revision: D4998465
fbshipit-source-id: 473009ea4917e86e365f9d23aa2f11a46a94fd65
Summary:
Added the possibility to add 'tiles' and 'axis' as input
as opposed to arguments for the Tile Operator. If provided, the input
values will override the argument values. Now with proper CUDA code
Differential Revision: D4930347
fbshipit-source-id: b44b032b327c7d7bddfce63abf4e3289d7e74bfb