Commit Graph

636 Commits

Author SHA1 Message Date
Aapo Kyrola
0954775d28 AddPadding CUDA version
Summary: CUDA version of the AddPadding op. It first executes a prefix-sum using Cub to compute the cumulative lenghts array. Then it launches a kernel that uses this information to fill the output tensor with start, end paddding and the actual contents.

Reviewed By: asaadaldien

Differential Revision: D6391413

fbshipit-source-id: 45b431e5976674729e53cb4752c7753c1d8a69e8
2017-11-22 18:17:21 -08:00
Andrew Tulloch
48415d83c8 Fix instance_norm_test.test_instance_norm_model_helper
Reviewed By: jerryzh168

Differential Revision: D6391749

fbshipit-source-id: ba861d401e358290782db8f360c430e3f3daae96
2017-11-22 15:05:29 -08:00
Yiming Wu
127a55ae49 cast op for empty batch
Summary: Cast op cuda can deal with empty batch now.

Reviewed By: azzolini

Differential Revision: D6350138

fbshipit-source-id: 2f3d19f4d42ff34806aa9597690e66f6b4de1a6b
2017-11-16 12:20:20 -08:00
Wenyi Huang
d8dfaeeef7 Add batch-based/row-based sparse from/to dense operator
Summary:
Two ops: BatchSparseToDenseOp and DenseToBatchSparseOp
Inverse operations of each other.

Details are described in op Doc

These op is used along with flexible topK, where the output is
lengths, indices, and values.
We want to do softmax on the values, but the dimension of each batch is different. So these op will convert sparse representation to dense and vice versa. The two ops are also gradient op for each other.

Reviewed By: chocjy

Differential Revision: D6288338

fbshipit-source-id: 0ba9e611058b39e46e7414dcc5f39cab29915fa3
2017-11-16 00:59:21 -08:00
Xiaolong Wang
3bde37fbf0 Listwise Ranking -- LambdaNDCG
Summary:
This is part one: It adds lambdaNDCG loss which can be used to heuristically
optimize the NDCG metric.

Differential Revision: D5830650

fbshipit-source-id: 1eb696337c9a77727ad40219c68f6468e2e097a5
2017-11-16 00:05:48 -08:00
Simon Layton
1ab3fd1a29 Fix Batched Matmul test accuracy
Summary:
Datatypes was being handled badly in reference check, causing sporadic fails in CI. All batched mat-mul with fp16 data is performed as pseudo-fp16, with all math in fp32. Adjusted the reference implementation to reflect this.

Adjusted the gradient check threshold to the best I could get to consistently pass.
Closes https://github.com/caffe2/caffe2/pull/1406

Differential Revision: D6324431

Pulled By: pietern

fbshipit-source-id: 83ff2584438a11f7a6db4599a4fb0e75e9e15a3d
2017-11-14 09:31:18 -08:00
James Reed
8701a2dfa3 Allow negative indices in Concat/Split ops
Summary: Closes https://github.com/caffe2/caffe2/pull/1440

Reviewed By: dzhulgakov

Differential Revision: D6290009

Pulled By: jamesr66a

fbshipit-source-id: 93eaff6103211ff89ed63ecaf4aa96d38e6bed63
2017-11-13 18:32:24 -08:00
Yan Zhu
7b047c161d NegateGradientOp and test
Summary: add NegateGradientOp: in forward pass, this op simply copies the input to output. In backward pass, it flips the sign of gradients.

Reviewed By: dragonxlwang

Differential Revision: D6314456

fbshipit-source-id: 56afd8b131eff9f7e120ab7e4e87461df49649d4
2017-11-13 18:05:14 -08:00
Xianjie Chen
c04ec84e1a disable uniform fill large blob
Reviewed By: pietern

Differential Revision: D6299413

fbshipit-source-id: 2ea4a5f1434060c3ab6fd42abd4052bdb10a37cc
2017-11-10 12:10:14 -08:00
Jeff Johnson
0440f3bf93 Reduce caffe2 GPU topk test sizes
Summary: The topk GPU test was taking too much time, but there are still a variety of codepaths to test (k <= 1024, k > 1024, k == 1, k == n). Reduce the batch sizes and n to reduce time taken by the in-python CPU code equivalent.

Reviewed By: pietern

Differential Revision: D6272628

fbshipit-source-id: b8b8f3601f28bf64f144c73d7c9e915f40c84d70
2017-11-10 07:47:00 -08:00
Xianjie Chen
d1c73eb407 use size_t for rand fill functions in math
Summary: The number of elements in the caffe2 blob can be larger than int32. Use size_t to prevent overflow.

Reviewed By: ajtulloch

Differential Revision: D6278363

fbshipit-source-id: 356e294c667a53360d8a65b56a63a39d5ce3384e
2017-11-09 18:44:46 -08:00
Wenyi Huang
7cedf80923 add flexible topK op
Summary:
Will probably rename to adaptive topK to be aligned with the layer name.

The main difference from top_k op is that the K is not fixed as a layer parameter,
instead this op takes in a blob that conatins K information for each row of the input data (batch mode).

Reviewed By: chocjy

Differential Revision: D6221209

fbshipit-source-id: f7fd575ff8f515d886d93278ad94fd17e8bd6fa5
2017-11-09 16:48:14 -08:00
Junjie Bai
e6fadfa76e Relaxing checks for fp16 in BatchMatMul tests
Reviewed By: pietern

Differential Revision: D6275557

fbshipit-source-id: e336ba9c897b88801f1be1b32029c5af58ec3fc5
2017-11-08 13:42:28 -08:00
Pieter Noordhuis
348e29c49b Don't run CUDA tests for ops without CUDA implementation
Summary: Closes https://github.com/caffe2/caffe2/pull/1434

Reviewed By: houseroad, ilia-cher

Differential Revision: D6272614

Pulled By: pietern

fbshipit-source-id: 7b998b08ec02b03f88a6fd24a949b0d199b2aa37
2017-11-08 10:28:02 -08:00
Xianjie Chen
cbb03b8db8 add modulo operator
Summary: as desc.

Reviewed By: chocjy

Differential Revision: D6240026

fbshipit-source-id: fa4dcccebc44b0a713946823b6f56e73d5d6146b
2017-11-06 16:44:16 -08:00
Alexander Sidorov
20feef45bc NNFC operator: an FC with noTrans noTrans options
Summary:
This seems to be faster in a bunch of cases. Prefer to keep it as a
separate op instead of MatMul + Add so its easy to compare perf on per
op basis between this one and the baseline (normal FC)

Reviewed By: akyrola

Differential Revision: D6169187

fbshipit-source-id: 09b96325d44bd181896f396aec88b27314c435b0
2017-11-03 15:08:39 -07:00
Philipp Keller
68ed66a2c5 Faster BatchBoxCox Operator using MKL
Summary: Use MKL VML vsPow() and row-major iteration for faster BatchBoxCox operator.

Reviewed By: kennyhorror

Differential Revision: D6042052

fbshipit-source-id: 54fc6b9184cb341672183a77730d79a271d09207
2017-11-03 12:04:03 -07:00
Dmytro Dzhulgakov
583bc63c98 Fix boundary checking in 8-bit sparselengthssum ops
Summary: Before the boundary checking was happening after the first access for 8bit ops.

Reviewed By: Yangqing

Differential Revision: D6206753

fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b
2017-11-03 05:19:57 -07:00
Aapo Kyrola
14f95c2782 Updated brew SpatialBN to use initializers
Summary: Updated brew SpatialBN to use initializers similar to other brew ops such as conv and fc instead of initilaizing all of its parameters itself within the brew call.

Reviewed By: asaadaldien

Differential Revision: D5840359

fbshipit-source-id: 9f3d688d4957605eaf7ecd2488bc26bfb1da3f78
2017-11-02 11:25:45 -07:00
Junjie Bai
7c2804ee90 Add support for doing broadcast with single elem dimensions at both ends
Summary: Closes https://github.com/caffe2/caffe2/pull/1413

Reviewed By: jamesr66a

Differential Revision: D6201556

Pulled By: bddppq

fbshipit-source-id: 1d443e895dbb3f5b67a5a0e027977b7807df3de1
2017-11-01 18:33:11 -07:00
Dong Li
3bfabb4d5f support float16 input for operator SparseAdagrad
Summary:
Implemented new CUDA class for operator SparseAdagrad. The param and moment inputs now can be float or float16.
The functions for mixed-precision add/mult/store are defined in a separate head file ("caffe2/core/float16_util.h") for reuse purpose.

Reviewed By: azzolini

Differential Revision: D5880200

fbshipit-source-id: dca227f38629a03a9d771f42efe2c0b673075c4d
2017-10-30 19:32:30 -07:00
Aapo Kyrola
669ec0ccba Added FP16 compute support to FC Op
Summary: Allow the GEMMs in the FC/FCGradient Op to do FP16 compute instead of FP32 if the appropriate op flag is set.

Reviewed By: asaadaldien

Differential Revision: D5839777

fbshipit-source-id: 8051daedadf72bf56c298c1cf830b019b7019f43
2017-10-30 17:03:51 -07:00
Junjie Bai
b7a9f51de3 In BatchMatMul, add support for accepting inputs >=2d
Summary: Closes https://github.com/caffe2/caffe2/pull/1399

Differential Revision: D6183083

Pulled By: bddppq

fbshipit-source-id: 5c8f17c2de212fbc39a66c90aa2599b714f5ceb4
2017-10-29 23:38:33 -07:00
Qinqing Zheng
42ffb1ae07 support non-normalized weights
Reviewed By: akyrola

Differential Revision: D6158290

fbshipit-source-id: 4d54e5c0d0f91f23deab18da047df4d209d4c312
2017-10-27 23:18:25 -07:00
Tilak Sharma
7b7dcaf269 Initialize presence tensor if data is empty.
Summary: See https://fb.facebook.com/groups/811605488888068/permalink/1645450575503551.

Differential Revision: D6116836

fbshipit-source-id: 3072643eaf6f134bda7d224af3d5f8339da1f39d
2017-10-27 01:05:42 -07:00
Qing He
0b0d5b2b1d Add tensor output that gives the sampled values
Summary: Given an additional tensor containing the values corresponding to the weighted samples, add tensor output that contains the values selected by the sampled indexes.

Reviewed By: akyrola

Differential Revision: D6050094

fbshipit-source-id: 1eccc641b99e30d36ae83d49f630b018a53e4147
2017-10-26 16:04:57 -07:00
Jiyan Yang
6e33ae79df Add gradient op for WeightedSum op
Reviewed By: dzhulgakov

Differential Revision: D6149163

fbshipit-source-id: 0e8cf400323233d001243bc5cb25a0025115a564
2017-10-26 00:16:51 -07:00
Ahmed Taei
5bb8ed67e3 Compute GLU for an arbitrary axis
Summary: As in title

Differential Revision: D6151804

fbshipit-source-id: bd0fa08be1676ebd1abd9720711c221c61c11ad1
2017-10-25 19:49:55 -07:00
Aapo Kyrola
2e4d8aa530 Added FP16/FP32 MomentumSGD + WeightDecay Update Ops
Summary:
Added two new ops, FP16MomentumSGDUpdate and FP32MomentumSGDUpdate, which perform both the momentum sgd and weight decay updates to a given parameter in a single op -- thus being more efficient.

Also updated the standard momentum sgd test to test if nesterov momentum works.

Reviewed By: asaadaldien

Differential Revision: D5837837

fbshipit-source-id: 5ad487b9c59434491d3a4fcfdeed820db6083f57
2017-10-24 12:28:16 -07:00
Junjie Bai
ed08533a1e Add CUDA version of ScatterAssign
Reviewed By: houseroad

Differential Revision: D6128352

fbshipit-source-id: ea59f4bc723ef929b0f6ed15797df776d8054422
2017-10-24 10:20:03 -07:00
Ahmed Taei
512a8015b8 Gated Linear Unit implementation
Summary: As titled

Differential Revision: D6117600

fbshipit-source-id: 84b0154dc4cf77cc9c9146e9a534c7485989346b
2017-10-23 18:14:57 -07:00
Yarik Markov
c6ef04db04 Add "dtype" parameter for GivenTensorOp
Summary: Adding "dtype" parameter for the GivenTensorOp. Also, providing backwards compatibility for the existing code, byt supporting the templating if "dtype" is not provided.

Reviewed By: bddppq

Differential Revision: D6090049

fbshipit-source-id: f5deaa57b49f2280289975f4583aba5bc064a2bc
2017-10-23 16:06:37 -07:00
Qinqing Zheng
6a4182eead weighted sample op cuda
Summary: CUDA version of weighted sampling operator; minor changes for CPU version

Reviewed By: asaadaldien

Differential Revision: D6106668

fbshipit-source-id: 42d7607bd845a4a39cf5b89d7476904cb5928431
2017-10-21 18:49:59 -07:00
Badri Narayan Bhaskar
25bfffeafe Swish Activation Function
Summary:
Swish: A self-gated activation function.
https://arxiv.org/pdf/1710.05941.pdf

Reviewed By: ajtulloch

Differential Revision: D6100424

fbshipit-source-id: 0103d6d82e9ffb50106c98a8785e62b8808e9af1
2017-10-20 10:37:43 -07:00
Junjie Bai
ee62a595fc ScatterAssign int types
Summary: Closes https://github.com/caffe2/caffe2/pull/1357

Reviewed By: dzhulgakov

Differential Revision: D6107036

Pulled By: bddppq

fbshipit-source-id: 9278dae988c3c0656b4e4fd08bf7ca1e2eec3348
2017-10-19 23:22:54 -07:00
Dmytro Dzhulgakov
623f2bf815 Add GivenTensorInt64Fill on gpu
Summary: Before we fix it properly with 'type' argument.

Reviewed By: bddppq

Differential Revision: D6103973

fbshipit-source-id: 8c00a93c373dd0ad0bbfe59944495f6574223ab6
2017-10-19 18:32:41 -07:00
Hassan Eslami
db6a9d2ae4 Fixes type inference for Slice and GivenTensor*Fill operators
Summary:
Currently, the type inference infers FLOAT as the type for all GivenTensor*Fill operators. However, the inferred type should match the actual operators.

Also, for `Slice` operator, there is a corner case where type inference fails

Reviewed By: azzolini

Differential Revision: D6096813

fbshipit-source-id: d65b7c0f42436138cbc49d8a5a62374fa5e927e1
2017-10-19 14:02:21 -07:00
James Cross
96c6212513 repeat sequence mask for data dims
Summary: Allow the application of sequence-length masking to be replicated along one or more minor axes. See task for details.

Reviewed By: jamesr66a

Differential Revision: D6090835

fbshipit-source-id: 9064232aa9b93246c582b6e0bae73be5dbe09e98
2017-10-18 18:08:08 -07:00
Bryan Wu
6ac393a32b WeightedSigmoidCrossEntropyWithLogits
Summary:
Op for computing SigmoidCrossEntropyWithLogits with per-label, per-sample weights. Can be used for addressing class or label imbalance.

Doc:
Given three matrices: logits, targets, weights, all of the same shape,
(batch_size, num_classes), computes the weighted sigmoid cross entropy between
logits and targets. Specifically, at each position r,c, this computes
weights[r, c] * crossentropy(sigmoid(logits[r, c]), targets[r, c]), and then
averages over each row.
Returns a tensor of shape (batch_size,) of losses for each example.

Reviewed By: stephenyan1231

Differential Revision: D5997723

fbshipit-source-id: f3172325f1c98b6f26e1700131ef897b743a72fc
2017-10-16 17:34:38 -07:00
Junjie Bai
1735c5f6c7 Add Filler op for double
Summary: Closes https://github.com/caffe2/caffe2/pull/1344

Reviewed By: dzhulgakov

Differential Revision: D6065137

Pulled By: bddppq

fbshipit-source-id: 1849beeaa4fee8cc056b685664f91daca71764b8
2017-10-16 13:48:15 -07:00
Aapo Kyrola
123cb5dd07 use non-cudnn transpose for int tensors
Summary: Turns out CuDNN's tensor transform only supports floats. Previous implementation pretended it would work with ints by casting to floats and indeed passed tests for some reason. But rgirdhar found a case where it returned nonsensical results. So rewire int-transposes to use non-cudnn version. Had to refactor a bit for that. Also added a test for the case.

Reviewed By: asaadaldien

Differential Revision: D6043284

fbshipit-source-id: cc3b14f9fbbdeff421b01da453a1d3c7c5ffd4ac
2017-10-13 14:02:48 -07:00
Junjie Bai
4c3b02f314 Enable Flatten operator to take an arbitrary axis arguemnt
Summary:
input dimensions up to "axis" will be flattened to the outer dim of output and the remaining input dims will be the inner dim
Closes https://github.com/caffe2/caffe2/pull/1330

Reviewed By: dzhulgakov

Differential Revision: D6039560

Pulled By: bddppq

fbshipit-source-id: e92c30b49a9288feeefc4a639522406e97e149e1
2017-10-13 12:28:22 -07:00
Tilak Sharma
1e8a16224f PackSegments: return value presence.
Summary:
Optionally return a blob of shape [batch size, max length] that is
false only in locations where the output tensor was padded.
One can separately convert lengths to segment ids and cast, but
this is more convenient, and possibly more efficient.

Differential Revision: D6006073

fbshipit-source-id: af6c4ea31972566e7d059dcd3fdd8afba97a88e9
2017-10-12 11:17:34 -07:00
Evgeniy Shin
c6f96c1d7b Add GPU support for LengthsTile
Reviewed By: kittipatv

Differential Revision: D5999171

fbshipit-source-id: cd0e305488f05c20d1925745fca0c4b4eef23071
2017-10-12 11:17:34 -07:00
Andrey Malevich
e13f199452 Switch RNNOp to use NetDef argument for step represenetation.
Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well.

Reviewed By: salexspb

Differential Revision: D5949330

fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f
2017-10-10 22:01:51 -07:00
Zhifeng Deng
66b8cb95e9 Add int64 support to sparse_to_dense_mask_op
Summary: [CAFFE2] Add int64 support to sparse_to_dense_mask_op

Reviewed By: ender-wieczorek

Differential Revision: D6022278

fbshipit-source-id: 489b6df4d43a64c743ee278d94929ca50259f7b8
2017-10-10 17:19:44 -07:00
Jerry Zhang
1b892ea295 Enable axis argument for MatmulOp
Summary: att

Reviewed By: ajtulloch

Differential Revision: D5523365

fbshipit-source-id: b7a379c9c4326cd642e7b4768cc590b5e1b94b6d
2017-10-10 16:47:37 -07:00
Jiyan Yang
2c44a9f9cd Add BatchBucketOneHotOp
Summary:
Input is a matrix tensor. Its first dimension is the batch
size. For each column, bucketize it based on the boundary values and then do
one hot encoding. The `lengths` specifies the number of boundary values for each
column. The final number of buckets is this number plus 1. This would also be
the expanded feature size. `boundaries` specifies all the boundary values.
Note that each bucket is right-inclusive. That is, given boundary values
[b1, b2, b3], the buckets are defined as (-int, b1], (b1, b2], (b2, b3], (b3, inf).
For example

If data = [[2, 3], [4, 1], [2, 5]], lengths = [2, 3],
and boundaries = [0.1, 2.5, 1, 3.1, 4.5], then

output = [[0, 1, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1]]

Reviewed By: xianjiec

Differential Revision: D5976030

fbshipit-source-id: fd746c20b19bcdf5f769451d804c219ad6463f28
2017-10-06 13:25:12 -07:00
Qing He
f535700ccc Add weighted_sampling operator to Caffe2
Summary: Add weighted_sampling operator to Caffe2

Reviewed By: akyrola

Differential Revision: D5962199

fbshipit-source-id: ab3f56a1dc7b8eaf4ed4d74af6c6c08dccca5a1e
2017-10-05 20:33:59 -07:00
Wenyi Huang
20b3918ba8 add cuda support for Topk Gradient
Summary: as title

Reviewed By: azzolini

Differential Revision: D5822303

fbshipit-source-id: 3bc88a9071167c41e3fc717a2b31dceee6fee360
2017-10-04 19:31:56 -07:00
Artem Volkhin
a2be56bc34 add GatherRangesToDense operator
Summary: adding an operator with behavior similar to fused GatherRanges and Split.

Reviewed By: kennyhorror

Differential Revision: D5961761

fbshipit-source-id: 616d4668b8901256418004def90d91a0b2041620
2017-10-04 15:18:10 -07:00
Uthsav Chitra
964d740ede adding batch support to SequenceMaskOps
Summary:
Added support for batching to SequenceMaskOp.

Let b be the batch dim and k be the axis dim. (We enforce that b < k.) Write the dimensions of the input tensor as [a_1, ..., a_b, ..., a_k, ...]. We first collapse our tensor down to 3D, with dimensions [P, Q, D], where: P = a_1 * ... * a_b, Q=a_{b+1} * ... * a_{k-1}, and D=a_k * a_{k+1} * ... * a_n. Then we mask each slice [i, :, : ] of this 3D tensor (note that each slice is a Q times D tensor w/ dimension 2)

Reviewed By: jamesr66a

Differential Revision: D5733382

fbshipit-source-id: e7a314d9fe6e6691a75112edbee8ba6e8ea8e396
2017-10-04 15:18:09 -07:00
Yangqing Jia
d315c62e72 Kick fbsync
Summary:
fbshipit-source-id: 886ac051235a878b5b0fe294619bb6184d5d24ab

(Note: this ignores all push blocking failures!)

Reviewed By: dzhulgakov

Differential Revision: D5947236

fbshipit-source-id: c3f7d00d5d7faad6366d4c456fffb9387f30b2aa
2017-09-29 16:31:11 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Andrei Chtcherbatchenko
cb986bb913 Deformable convolution operator in Caffe2
Summary:
This diff implements deformable convolution operator. The idea behind it is that instead of using a fixed NxM kernel, we associate a set of learnable offsets (dx, dy) with each element of the kernel, and use bilinear interpolation to estimate weights in between the integer indices. For background see paper https://arxiv.org/abs/1703.06211 and mxnet implementation https://github.com/msracver/Deformable-ConvNets/tree/master/rfcn/operator_cxx

To simplify code review of the new files the feature is stacked into 2 diffs. First diff duplicates core convolution operator into a separate set of files prefixed with deform_. It also provides documentation on the operator but nothing else. Second diff contains the actual changes that make deformable convolution possible. Thefore, I recommend focusing your code review on changes between diffs 1 and 2.

Current limitations of the operator:
1. Only CUDA is supported. CPU version is not implemented.
2. Only NCHW layout is supported.
3. Only 2d convolution is supported.

CUDA code is ported from mxnet implementation with minimal changes.

See also inline comments in code for tricky parts.

Reviewed By: akyrola

Differential Revision: D5702983

fbshipit-source-id: 4d1bf2c6c73135e6a70dbe87037b38915f4453f9
2017-09-26 16:20:31 -07:00
Aapo Kyrola
8a45b65f96 ReduceFrontMax, ReduceBackMax + gradients, CPU and CUDA
Summary: Implementation of ReduceFront/Back/Max/Gradient for CPU and CUDA.

Reviewed By: asaadaldien

Differential Revision: D5905402

fbshipit-source-id: 6967ce41aa95ee5ea7a90065430892e81a6da477
2017-09-26 15:22:25 -07:00
Frank Jiang
0a5ee1e806 Implemented RowWiseSparseAdagrad operator that only keeps one moment term per embedding
Summary: Implemented version of SparseAdagrad that only keeps track of an average sum of squared gradients term for each row of the parameter tensor, rather than a sum of squared gradients term for each individual parameter.

Differential Revision: D5881918

fbshipit-source-id: bd96ccf25554b457baaaca9309fc8048adbb37f7
2017-09-26 13:34:44 -07:00
Aapo Kyrola
753133f015 SignOp
Summary: Equivalent to numpy.sign for CPU and CUDA.

Reviewed By: dzhulgakov

Differential Revision: D5906446

fbshipit-source-id: 389f994bccbb87a62df2c4aaacc327f9a6223cbd
2017-09-26 09:17:45 -07:00
Junjie Bai
d9b0bcd7a4 Make all existing (except in RoIPool) "is_test" arguments required
Reviewed By: akyrola

Differential Revision: D5830168

fbshipit-source-id: 8634e9cfe308ba0ee90cd8a5c4b09a47b0b5f015
2017-09-25 23:46:12 -07:00
Alisson Gusatti Azzolini
15a7bb3bff GatherByKeyOp (Inverse operation of PartitionOp)
Summary: Can be used to gather outputs of a sharded "Gather", or for the SparseLengthsSumGradient when we need the gradient on values.

Reviewed By: akyrola

Differential Revision: D5800901

fbshipit-source-id: 90835755d6d15be13fb0f538cfade980cf4a1cd2
2017-09-24 22:18:17 -07:00
Misha Smelyanskiy
2cbb4167c1 Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Summary:
Adding uint8  support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.

Performance Results
===================

Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par.  Block size was 128 in all cases.

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171]         0.75769 SparseLengthsSum8BitsRowwise

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171]        0.233322 SparseLengthsSum8BitsRowwise

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171]        0.106591 SparseLengthsSum

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171]        0.211041 SparseLengthsSum

Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.

However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
   For every pair of scale and bias, we bring entire cache line of
   64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
   reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
   cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
   https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/

To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)

block_size  time(uint8) time(float16) time(float32)
64          0.19        0.09          0.17
128         0.12        0.09          0.17
256         0.70        0.09          0.14
1024        0.50        0.06          0.10

The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.

Reviewed By: kennyhorror

Differential Revision: D5870907

fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b
2017-09-21 14:50:43 -07:00
Dmytro Dzhulgakov
0fff025973 Consistent behavior of max reduction for segment ops and fix test
Summary:
Two implementation of max pool reducers had different semantics in case of equal indices. It matters less in real cases, but breaks tests. Choosing the behavior of LengthMax over SortedSegmentRangeMax as the former is more widely used.

Also some minor tweaks for the test code.

Reviewed By: Yangqing

Differential Revision: D5870386

fbshipit-source-id: 6488cbd5cacaf595ffc07c44084730dd44b3f9dd
2017-09-20 10:59:43 -07:00
Yangqing Jia
06b7a9e0f6 Backed out changeset 3a5c020294d8
Summary:
Broke
  CAFFE2_HYPOTHESIS_PROFILE=debug buck test //caffe2/caffe2/python:lengths_reducer_rowwise_8bit_ops_test

Reviewed By: kennyhorror

Differential Revision: D5867880

fbshipit-source-id: 80c6f23eccb59b74be4a7258b4f193d79f814c3f
2017-09-19 17:54:18 -07:00
Misha Smelyanskiy
b468ffe6d1 Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting
Summary:
Adding uint8  support for to code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators.

Performance Results
===================

Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses
code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator,
sparse_lengths_sum_benchmark.new.par.  Block size was 128 in all cases.

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8
I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type:
I0912 02:49:58.773264 2640913 net_simple.cc:171]         0.75769 SparseLengthsSum8BitsRowwise

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8
I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type:
I0912 02:50:33.981837 2642102 net_simple.cc:171]        0.233322 SparseLengthsSum8BitsRowwise

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16
I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type:
I0912 02:51:26.748977 2643925 net_simple.cc:171]        0.106591 SparseLengthsSum

[root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float
I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type:
I0913 01:39:22.372244 1076874 net_simple.cc:171]        0.211041 SparseLengthsSum

Analysis
========
Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h
as shown below.

However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that:
1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors
2. In addition to emebding blocks, we are now also reading scale_bias.
   For every pair of scale and bias, we bring entire cache line of
   64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence
   reading nearly entire extra cache lines of useless data adds to bandwidth wastage.
3. In addition, hardware prefetcher runs past the end of the input block and scale_bias
   cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of
   https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/

To get deeper insights into what is going on,
we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8,
into a microbenchmark, where we varried block size, while keeping table size constant (256MB)

block_size  time(uint8) time(float16) time(float32)
64          0.19        0.09          0.17
128         0.12        0.09          0.17
256         0.70        0.09          0.14
1024        0.50        0.06          0.10

The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark.
However, we see that as block_size increases (for a fixed table size),
time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving
speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher
running past the end of the block.

Reviewed By: dzhulgakov

Differential Revision: D5824641

fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6
2017-09-19 10:50:09 -07:00
Xianjie Chen
eccfa1041c fix cuda GatherOp for empty batch
Summary: as title

Differential Revision: D5840432

fbshipit-source-id: 5d9021f152c21d24e91dc0cc3d95443782afc228
2017-09-15 17:40:43 -07:00
Dhruv Mahajan
c3fd31b1a2 weights for labels in image_input_op
Summary: Introduced weight for labels in multi-lable setting. An extra weight blob is introduced and read in the operator in case lable setting is weighted sparse.

Reviewed By: kevinwilfong

Differential Revision: D5812467

fbshipit-source-id: efb209092e1e9effc915b0a753fa0c67b47a4fb6
2017-09-15 17:40:42 -07:00
Aapo Kyrola
fb45383ed6 resubmission of PR1175: fp16 BatchMatMul
Summary: PR 1175 caused a build error because gemmBatched was only under a specific #ifdef. Now put it outside the #ifdef, and things work.

Reviewed By: asaadaldien

Differential Revision: D5834868

fbshipit-source-id: 072a64c8f4b259ff7504104121766115b46b8aa0
2017-09-14 21:46:05 -07:00
Jerry Zhang
0e7bd68536 Allow one output for droput at inference time
Summary: att

Reviewed By: bddppq

Differential Revision: D5680214

fbshipit-source-id: 19e731901cb5c9491100c61baefc4b75e6e8b262
2017-09-14 10:46:41 -07:00
Yangqing Jia
f0d0361609 Revert D5794634: [caffe2][PR] fp16: BatchMatMul
Summary:
This reverts commit 911c462824edec3de529a5a4385a4c437e24bf59

bypass-lint

Differential Revision: D5794634

fbshipit-source-id: 1863b02282329cbee6b10e5870f03051b4bb6c58
2017-09-13 18:46:47 -07:00
Luke Yeager
37af6566e1 fp16: LSTMUnit
Summary:
Was https://github.com/caffe2/caffe2/pull/1151
Closes https://github.com/caffe2/caffe2/pull/1191

Differential Revision: D5825387

Pulled By: akyrola

fbshipit-source-id: edb47c8bd7ffb72e1e587a9c5bfee9347e3d587e
2017-09-13 15:47:03 -07:00
Junjie Bai
90ca470d70 Standardize operator argument "is_test"
Summary:
Also add the ability to mark an argument as required.

Added a string constant `OpSchema::Arg_IsTest` for `is_test` arg.
If users define the `is_test` argument with `ArgIsTest(...)`, then it automatically becomes required argument, in the meanwhile user can still use `Arg("is_test", ...)` to define an optional `is_test` argument.

Reviewed By: akyrola

Differential Revision: D5812391

fbshipit-source-id: eaaba50d027813a8012389edc6c459de23c3c728
2017-09-13 14:35:27 -07:00
Luke Yeager
3cfc6f26e7 fp16: BatchMatMul
Summary:
Was https://github.com/caffe2/caffe2/pull/1151
Closes https://github.com/caffe2/caffe2/pull/1175

Reviewed By: Yangqing

Differential Revision: D5794634

Pulled By: akyrola

fbshipit-source-id: 911c462824edec3de529a5a4385a4c437e24bf59
2017-09-13 14:35:25 -07:00
Alisson Gusatti Azzolini
c07ebd2396 TrimDataset to ensure size is multiple of number or replicas
Summary: For data parallel we need the batch size to be multiple of nubmer of replicas. In order to do so with this diff we do Dataset(rec).trim(multiple_of=num_replicas)

Reviewed By: dzhulgakov, harouwu

Differential Revision: D5753861

fbshipit-source-id: c5d728b925707dbd3d1f500a93e67e185c223569
2017-09-13 12:17:21 -07:00
Luke Yeager
361bbb8b43 fp16: SumReduceLike
Summary:
Was https://github.com/caffe2/caffe2/pull/1151
Closes https://github.com/caffe2/caffe2/pull/1183

Differential Revision: D5794704

Pulled By: akyrola

fbshipit-source-id: e4dee46f753e9a8663057c81f23028f6246fba02
2017-09-13 11:46:23 -07:00
Sachin Padmanabhan
a198da5583 Added LengthMax Operator to Caffe2
Summary: Added LengthMax operator to Caffe2.

Reviewed By: dzhulgakov

Differential Revision: D5720124

fbshipit-source-id: 1995fea8e480c9a9f3e054d02801b03c1ce6c51b
2017-09-12 20:01:48 -07:00
Viswanath Sivakumar
583d031754 Operator to compute RoI region coordinates for RMAC
Summary:
Computes a fixed grid or RMAC region coordinates for a given 4D feature tensor
(NCHW) as described in https://arxiv.org/abs/1511.05879. The output is the
`roi` format expected by RoIPoolOp. To compute the actual RMAC itself, the
output of this op should be passed to RoIPoolOp.

Reviewed By: wickedfoo

Differential Revision: D5594994

fbshipit-source-id: 5edac98a18137b53555f9a16354419b424679c99
2017-09-12 12:47:17 -07:00
Xianjie Chen
be406b1e5f Revert D5639080: Caffe2: Cuda implementation for BatchOneHot operator
Summary:
This reverts commit 8ee280c4bab64c1fdfb7429ee2c9ac8c02933931

bypass-lint

Differential Revision: D5639080

fbshipit-source-id: cf522822b7cb5ba9a238ba7837f0f522e1f49b73
2017-09-12 11:51:14 -07:00
Mayank Rana
1c414426df Caffe2: Cuda implementation for BatchOneHot operator
Summary: Cuda implementation for BatchOneHot operator.

Reviewed By: lvdmaaten

Differential Revision: D5639080

fbshipit-source-id: 8ee280c4bab64c1fdfb7429ee2c9ac8c02933931
2017-09-11 08:24:44 -07:00
Luke Yeager
1cf94854a4 fp16: SequenceMask
Summary:
Was https://github.com/caffe2/caffe2/pull/1151
Closes https://github.com/caffe2/caffe2/pull/1178

Reviewed By: bddppq

Differential Revision: D5794641

Pulled By: akyrola

fbshipit-source-id: c3bd99dde74317280a65af7cc7a36a6a734822f6
2017-09-09 13:02:38 -07:00
Luke Yeager
6cf172c60d fp16: SumSqrElements
Summary:
Was https://github.com/caffe2/caffe2/pull/1151
Closes https://github.com/caffe2/caffe2/pull/1179

Differential Revision: D5794650

Pulled By: akyrola

fbshipit-source-id: 63e7973a88193a3b74ac4ba677df737889cbf0b6
2017-09-08 16:36:51 -07:00
Luke Yeager
1a2b229d47 fp16: add test for FC
Summary:
fp16 and TensorCore support was already added to the op in https://github.com/caffe2/caffe2/pull/1056. This adds a test.
Closes https://github.com/caffe2/caffe2/pull/1182

Differential Revision: D5794698

Pulled By: akyrola

fbshipit-source-id: b0d7ef317dfbb9d712b0b4646b38dc600b8434f1
2017-09-08 10:34:34 -07:00
Yan Shang
b6c9ecac7c Fix shape inference of distance_op
Summary: The shape inference of distance_op has issues (only works when inputs are 1D tensors). This diff fix the shape inference and the unit test.

Reviewed By: kittipatv

Differential Revision: D5788744

fbshipit-source-id: cb1b7facf7b9ccd64b54edca156325eceef50f33
2017-09-07 17:16:46 -07:00
Junjie Bai
176f8f9a19 Make ConvTranspose allow optional bias term
Reviewed By: jerryzh168

Differential Revision: D5755702

fbshipit-source-id: a00487ca376d09b68132162c53797f5af052d114
2017-09-07 17:16:43 -07:00
Kittipat Virochsiri
3251c60804 TensorInferenceFunction for Unique
Summary: Filling in the gap in tensor inference

Reviewed By: sunnieshang, akyrola

Differential Revision: D5779550

fbshipit-source-id: 9ec68c9dad566183d7d0fc2819829c2b91430dda
2017-09-06 15:37:11 -07:00
Aapo Kyrola
ceb13bf3fb Fix cell/hidden init issue, add copy states to test
Summary: As title. Wonder this had not been encountered before. Only affects cases where the states are copied over though.

Reviewed By: Yangqing

Differential Revision: D5777314

fbshipit-source-id: 8aef435c832e4ead5bb3d3e35bb065c734a2af5f
2017-09-06 14:16:17 -07:00
Aapo Kyrola
631971e459 threaded RNN executor for CPU, multi-stream executor CUDA
Summary:
Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs.
With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism
over timesteps. In my experiments, it was not good to use more than 2 streams, though.

Flag --caffe2_rnn_executor can be used to switch the executor off.

Reviewed By: salexspb

Differential Revision: D5749304

fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c
2017-09-06 12:26:30 -07:00
Dong Li
1104bab796 add axis argument to NormalizeOp and NormalizeGradientOp
Summary:
As described in task T21337239, NormalizeOp currently normalizes over only the last dimension.
In this commit, the following changes have been made:
(1) Added an axis-parameter to NormalizeOp in both the CPU and CUDA context.
(2) Added the same axis parameter to  NormalizeGradient in both the CPU and CUDA context
(3) Removed the limit that the original NormalizeOp operator requires the input dimension to be 2

Reviewed By: akyrola

Differential Revision: D5745162

fbshipit-source-id: 69e04f59ac4d954b0062c3b2a53c8ca465a1027b
2017-09-05 11:17:32 -07:00
Curtis Huang
c9238671ee Use char-ngram embedding for out-of-vocabulary words
Summary:
**Description**

Provide DeepText model with the functionality to load a secondary index (pre-trained char-ngram embedding, e.g. FastText) during training/test.  Embeddings of out-of-vocabulary words will be computed on-the-fly during training/test by averaging the char-ngram embeddings.

**Approach**

This diff provides two custom operators to accomplish this task – ConditionalOp and IndexCharNgramGetOp.  We first use IndexCharNgramGetOp to perform char-ngram index lookup and return a sparse tensor segmented by lengths for each token.  The sparse tensor is then used to compute the average embedding provided by the char-ngram index.  Finally, we use a ConditionalOp to replace those whose embeddings were not found in the original index during the feature apply stage.  Please refer to documentations of the code for more details.

Reviewed By: jamesr66a

Differential Revision: D5666924

fbshipit-source-id: f76605d093154a014d5b9ebf9510de9d79874eee
2017-09-01 19:16:49 -07:00
James Cross
53ccbd9a6e soft-coverage attention
Summary:
Implementation of a new variant of attention module, which contains a recurrent decoder state with vectors corresponding to each source-side word and strictly increasing values, thus enabling it to model the degree to which source words have been translated.

The approach is a variant of the approaches described in https://arxiv.org/pdf/1601.04811.pdf. We simply include the sum of all previous attention weights for encoder words as a new recurrent state (coverage_t). A new linear transform on encoder_outputs is used to produce coverage_weights, which has the same dimensionality as encoder_outputs, and implicitly models the fertility of source-side words (and putting this extra information strain on the encoder network).

Thus the encoder output, the decoder state, and the coverage weights have the same dimensionality for a given source word, and attention logits are calculated as v *  tanh(coverage * coverage_weights + encoder_output + decoder_state).

Note: the entire coverage state for each translation instance is of shape (encoder_length, coverage_units), but the states for the RecurrentNetwork operator, used to train the decoder, must be flat in the data dimension. This state is therefore initialized with shape (encoder_length * coverage_units) [not shown in the open-source library] and reshaped appropriately within the apply_soft_coverage_attention() function.

Differential Revision: D5593617

fbshipit-source-id: 7d0522b5eb0b26f22e8429e4461a459f2f16ed46
2017-08-31 21:21:54 -07:00
Jerry Zhang
debceaff02 Support new arguments in ConvTranspose
Summary: Adding support to use kernels, strides, pads etc. as arguments.

Reviewed By: houseroad

Differential Revision: D5710699

fbshipit-source-id: 8b63af4c4a76cd06b637a376aeb29a34c659be2e
2017-08-31 11:17:32 -07:00
Kittipat Virochsiri
4ec26d23a7 TensorInference function for LengthsSum and such
Summary: Adding missing tensor inference function

Reviewed By: kennyhorror

Differential Revision: D5735119

fbshipit-source-id: 1602b5aeec95f13a3c3c6d3e5417af2712a4dfbb
2017-08-31 09:32:48 -07:00
Misha Smelyanskiy
080fab8f6c Code generator for and high-performance emebding look-up kernels, supporting
Summary:
Code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers.
Achieve at least 1.5x speedup on float and over 2x speedup for float16, compared to existing code
These are results on Broadwell, using sparse_lengths_sum_benchmar.par benchmark

Old
==============
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000
Preparing lookup table. 2017-08-08 00:10:23.101848
Preparation finished. 2017-08-08 00:10:27.955680
I0808 00:10:27.955732 30700 net.cc:177] Starting benchmark.
I0808 00:10:27.955759 30700 net.cc:178] Running warmup runs.
I0808 00:10:27.956367 30700 net.cc:188] Main runs.
I0808 00:10:31.839035 30700 net.cc:199] Main run finished. Milliseconds per iter: 0.388264. Iters per second: 2575.56
I0808 00:10:35.704169 30700 net.cc:233] Operator #0 (indices, Python) 0.0583264 ms/iter
I0808 00:10:35.704210 30700 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.327694 ms/iter
I0808 00:10:35.704213 30700 net.cc:237] Time per operator type:
I0808 00:10:35.704217 30700 net.cc:246]        0.327694 SparseLengthsSum
I0808 00:10:35.704221 30700 net.cc:246]       0.0583264 Python
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000 --dtype float16
Preparing lookup table. 2017-08-08 00:10:59.047159
Preparation finished. 2017-08-08 00:11:05.140565
I0808 00:11:05.140612 31725 net.cc:177] Starting benchmark.
I0808 00:11:05.140635 31725 net.cc:178] Running warmup runs.
I0808 00:11:05.141104 31725 net.cc:188] Main runs.
I0808 00:11:08.371510 31725 net.cc:199] Main run finished. Milliseconds per iter: 0.323039. Iters per second: 3095.6
I0808 00:11:11.671450 31725 net.cc:233] Operator #0 (indices, Python) 0.0609876 ms/iter
I0808 00:11:11.671489 31725 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.26856 ms/iter
I0808 00:11:11.671494 31725 net.cc:237] Time per operator type:
I0808 00:11:11.671497 31725 net.cc:246]         0.26856 SparseLengthsSum
I0808 00:11:11.671500 31725 net.cc:246]       0.0609876 Python

New (Misha's)
==============
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000
Preparing lookup table. 2017-08-07 23:44:55.897748
Preparation finished. 2017-08-07 23:45:00.708896
I0807 23:45:00.708945 4178361 net.cc:177] Starting benchmark.
I0807 23:45:00.708971 4178361 net.cc:178] Running warmup runs.
I0807 23:45:00.709444 4178361 net.cc:188] Main runs.
I0807 23:45:03.608551 4178361 net.cc:199] Main run finished. Milliseconds per iter: 0.289909. Iters per second: 3449.36
I0807 23:45:06.536182 4178361 net.cc:233] Operator #0 (indices, Python) 0.0572399 ms/iter
I0807 23:45:06.536224 4178361 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.23512 ms/iter
I0807 23:45:06.536228 4178361 net.cc:237] Time per operator type:
I0807 23:45:06.536232 4178361 net.cc:246]         0.23512 SparseLengthsSum
I0807 23:45:06.536236 4178361 net.cc:246]       0.0572399 Python
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000 --dtype float16
Preparing lookup table. 2017-08-07 23:45:17.191579
Preparation finished. 2017-08-07 23:45:23.173668
I0807 23:45:23.173715 4179316 net.cc:177] Starting benchmark.
I0807 23:45:23.173743 4179316 net.cc:178] Running warmup runs.
I0807 23:45:23.174090 4179316 net.cc:188] Main runs.
I0807 23:45:24.939749 4179316 net.cc:199] Main run finished. Milliseconds per iter: 0.176564. Iters per second: 5663.67
I0807 23:45:26.698885 4179316 net.cc:233] Operator #0 (indices, Python) 0.0557303 ms/iter
I0807 23:45:26.698923 4179316 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.119794 ms/iter
I0807 23:45:26.698927 4179316 net.cc:237] Time per operator type:
I0807 23:45:26.698931 4179316 net.cc:246]        0.119794 SparseLengthsSum
I0807 23:45:26.698935 4179316 net.cc:246]       0.0557303 Python

Reviewed By: salexspb

Differential Revision: D5582172

fbshipit-source-id: d71f5a55580b734a51b8f30852b75f379acfdaf2
2017-08-30 16:22:11 -07:00
Ahmed Taei
5315669bd8 Add ShapeInference for ConcatOp (Fixed)
Reviewed By: akyrola

Differential Revision: D5721442

fbshipit-source-id: 64ed35cb4c40f32a5cca29fe9cd04e18a340db4b
2017-08-29 12:18:03 -07:00
Aapo Kyrola
488abdcd6c slice op shape inference
Summary: As titled + test

Reviewed By: jamesr66a

Differential Revision: D5720637

fbshipit-source-id: eae76e587808139fcf06abc0f8345152979815ec
2017-08-29 11:05:24 -07:00
Aapo Kyrola
7c7603a60e fix FC shape inference
Summary: FC shape inference was broken for non-default axis. Add test.

Reviewed By: asaadaldien

Differential Revision: D5720146

fbshipit-source-id: f36f9cc8477dc61c3b07eeea8ea0702562045c88
2017-08-28 16:08:07 -07:00
Yangqing Jia
9f693b39aa Revert D5711951: [caffe2] Add shape inference for ConcatOp
Summary:
This reverts commit 9173ef0f18af25326ec18e66f6ce29eecfa5ceea

bypass-lint

Differential Revision: D5711951

fbshipit-source-id: 9bbb872eafcbd3c470b782a5ddb2a1c894888101
2017-08-25 23:37:38 -07:00
Ahmed Taei
da418f5744 Add shape inference for ConcatOp
Reviewed By: akyrola

Differential Revision: D5711951

fbshipit-source-id: 9173ef0f18af25326ec18e66f6ce29eecfa5ceea
2017-08-25 18:09:35 -07:00
Jerry Zhang
3c180ba317 Opensourcing channel shuffle
Summary: att

Reviewed By: Yangqing

Differential Revision: D5662540

fbshipit-source-id: 474d7d808841ff8f7ce97b55df836b9d2f4a7629
2017-08-25 16:46:31 -07:00
Alexander Sidorov
7eba614503 RNNCell: Initializers interface, simplify _LSTM helper
Summary:
_LSTM helper is a legacy piece we had before all the RNNCell awesomeness landed. Now we need to pull it apart and create separate building blocks that people can use for any RNNs.

Please note changes to a test with double scoping. That should go away once we change RNNCell scoping logic in such a way that each cells ads its own name to the scope for all of its outputs (see another diff: D5613139 )

Reviewed By: jhcross

Differential Revision: D5632276

fbshipit-source-id: 1cb568ab995c4c0b3dd1b4bad2d028e34bded9c1
2017-08-25 12:01:24 -07:00
Aapo Kyrola
82360d8cba shape inference for ReduceFront/Back/Sum/Mean, Gather and Dropout
Summary: These were missing and required for some seq2seq models. Unit tested. The previous implementation of ReduceBackMean shape inference was incorrect, so removed it.

Reviewed By: asaadaldien

Differential Revision: D5691262

fbshipit-source-id: 76f868b298440f988635966a410f0232301ca6c4
2017-08-25 11:31:17 -07:00
Alisson Gusatti Azzolini
5e0b28e7bd PrependDimOp
Summary:
Split the first dimension of a tensor into 2, the first of which is fixed and given in the argument.
This is used to then split batch into smaller batches and distributed it across workers.

Reviewed By: harouwu

Differential Revision: D5702175

fbshipit-source-id: 02bb93e49bf9db411b516e149c8e647301dd2ca5
2017-08-24 18:52:05 -07:00
Jiyan Yang
20c854d43c Make FC op work with empty batch in cuda
Reviewed By: xianjiec

Differential Revision: D5673458

fbshipit-source-id: d1c950c94173843670ae1fae0e15ff61ca7d6761
2017-08-24 18:52:04 -07:00
Jerry Zhang
7f4ceb83e3 Relax dimension constraints for weight matrix in FC
Summary: att

Reviewed By: Yangqing

Differential Revision: D5662265

fbshipit-source-id: 893ee2f92debab06117725beeca3199cba565f1e
2017-08-24 11:16:39 -07:00
Catherine Dong
1955d0797e Added fast path for CUDNN global max pooling
Summary:
This adds a fast path for global max pooling with NCHW. Compared to equivalent ReduceBackMean, this is about 3.5x faster.

Based on D5533059.

Reviewed By: akyrola

Differential Revision: D5681122

fbshipit-source-id: 7a4df934044c7dd01888f095f7dd46654aaf4eae
2017-08-23 16:33:06 -07:00
Alisson Gusatti Azzolini
930acc8e85 CUDA SparseLengthsWeightedSum
Summary: title.

Reviewed By: harouwu

Differential Revision: D5665776

fbshipit-source-id: a8ae1a71a9a21e68172662f38b5f799870b9dcd1
2017-08-22 15:42:02 -07:00
Junjie Bai
5748e7140f Strip Operator Schema in mobile build
Reviewed By: Yangqing

Differential Revision: D5677792

fbshipit-source-id: d29edb26a36b24a46821e13e2d77af0f21571fcd
2017-08-22 13:31:08 -07:00
Douglas Chen
440d979075 Optimizations for Caffe2 SinusoidPositionEncodingOp
Summary:
Optimizations for SinusoidPositionEncodingOp to sinusoid position embeddings
more competitive against table based embeddings.
- Removed most calls to std::pow
- Replaced division with multiplication with reciprocal
- Reused computation across examples within a batch

Current speedup with batch size of 16, sequence length of 128 and embedding
size of 512 is about 270x (17k embeddings per second -> 4.7M embeddings per
second). The speedup is very dependent on the batch size; at a batch size of 4
this only gets 1.7M embeddings per second.

Profile: https://pxl.cl/8zf0
Annotated DoRunWithType: P57925031

Reviewed By: jamesr66a

Differential Revision: D5634766

fbshipit-source-id: 0f35bb176164ea547c91de242a0205c5d7adf7cf
2017-08-22 00:04:06 -07:00
Zhicheng Yan
0e20a7cb7d ImageInputOp_more_data_augmentation
Summary:
Add more data augmentation to ImageInputOp
1) Inception-style random sized cropping
2) color jittering
3) color lighting

Reviewed By: panshen1

Differential Revision: D5637726

fbshipit-source-id: 45d9cc69eec9f4d48c1607d80ccd89e325961b1a
2017-08-19 14:15:58 -07:00
Eider Moore
d6632a9a05 Adding a range operator similar to np.arange
Summary:
Adding a range operator in the spirit of np.arange. It is an imporant building block for a lot of manipulation functions.

This accepts parameters with the same meaning in the same order as python's range or np.arange (e.g. `(stop)`, `(start, stop)` or `(start, stop, step)`)

Differential Revision: D5616861

fbshipit-source-id: 02622b8bd85ebca125cc881c06fae5b54b7c602a
2017-08-18 14:45:56 -07:00
Philipp Keller
d617a77433 Add tests for ConcatOp and SplitOp
Summary: The new test ensures 'add_axis' and 'split' arguments work as intended for tensors of various dimensions. Hypothesis should checks various edge cases like zeroes in 'split_info' and 1D input with axis=0, add_axis=1.

Reviewed By: hoangmit

Differential Revision: D5645778

fbshipit-source-id: 061f9511a082da54e5c1bbe53a0e7096af4b8d1b
2017-08-18 14:02:42 -07:00
Chonglin Sun
5f612d9740 GPU version of BatchGatherOp
Summary: GPU version of BatchGatherOp.

Reviewed By: azzolini

Differential Revision: D5613593

fbshipit-source-id: 0e4a35b84db852ac2718868a02fa90e7c3d8f1f0
2017-08-17 18:31:10 -07:00
James Reed
f388135d3f Layer norm brew wrapper
Summary: Implement a brew wrapper for the LayerNorm op. This adds the scalar weight and bias terms to the op.

Reviewed By: jmp84

Differential Revision: D5595836

fbshipit-source-id: 467b2e1158b0c454a149d4b26c47719826e98752
2017-08-17 11:17:47 -07:00
James Reed
e45e621b0e Implement layer norm gradient GPU
Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf

Reviewed By: wickedfoo

Differential Revision: D5594445

fbshipit-source-id: 873643165c958fd5829fa7cf07d5d4b1b8b0ed59
2017-08-17 11:17:46 -07:00
James Reed
8e8e90f595 IMplement layer normalization backward CPU
Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf

Reviewed By: jmp84

Differential Revision: D5578306

fbshipit-source-id: 94d262f0317b3ee1b504e0110ad5135afe8350ca
2017-08-17 11:17:46 -07:00
James Reed
e16c40eb4f Implement layer normalization op forward GPU
Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf

Reviewed By: wickedfoo

Differential Revision: D5552262

fbshipit-source-id: d0cddb0769623a1b3779e2114c19e6ebc57c0f0d
2017-08-17 11:17:45 -07:00
James Reed
474c043be5 Implement layer normalization op forward CPU
Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf

Reviewed By: akyrola

Differential Revision: D5543381

fbshipit-source-id: 1102e568439af6a60aad3b87017d5a997fb7dc16
2017-08-17 11:17:44 -07:00
Aapo Kyrola
e89474c496 fix forward_only mode
Summary:
Forward-only mode had broken at some point. Two things: RNNCell did not pass the parameter to recurrent.py and also recurrent.py was broken if forward_only=True after python3 codemod.

Added test to rnn_cell_test to actually check the forward only parameter is passed to prevent future breakage.

Reviewed By: jmp84

Differential Revision: D5639306

fbshipit-source-id: b1bbc39d59c3f3734b2f40a1c2f3740c733e0bd4
2017-08-17 10:19:04 -07:00
Jerry Zhang
a63e7314f3 Adding 1d-2d-3d Schemas for Conv and Pool
Summary: Add Conv and Pool operators with dimensions.

Reviewed By: bddppq

Differential Revision: D5588614

fbshipit-source-id: 2552c40dc3ca180a6ab51817d60f0b85b97885d5
2017-08-17 09:45:54 -07:00
Jerry Zhang
4ca5735753 Allow inplace for spatial_bn_op
Summary: att

Reviewed By: Yangqing

Differential Revision: D5644717

fbshipit-source-id: 1a020fe4ca7028056ce7bebddb7bfd1437998530
2017-08-17 09:18:55 -07:00
Badri Narayan Bhaskar
ae2aad9c0d Operator to Merge ID_LIST features
Summary:
As an alternative to sharing embeddings, we want to explore merging the ID_LISTs in the net.

This commit adds an operator to merge many ID_LIST features into a single one.

Differential Revision: D5481523

fbshipit-source-id: 446121122a32de5682d5d75a165370bc8d776d03
2017-08-17 01:16:00 -07:00
Jingfei Du
b3029df1d0 Added window mode for caffe2 sequence operator
Summary: This can be used for local attention to mask elements outside of a window

Reviewed By: jamesr66a

Differential Revision: D5643677

fbshipit-source-id: 92b33866258ccc7307d5bcf08234610aa3fb152d
2017-08-16 21:34:29 -07:00
Kevin Wilfong
1f47a80e88 Caffe2: diagonal fill op
Summary: Caffe2: diagonal fill op

Reviewed By: panshen1

Differential Revision: D4775640

fbshipit-source-id: bb388ffe223e6b153d4cde1fdad6f84a2bb65b0f
2017-08-16 13:05:11 -07:00
Aapo Kyrola
a53192e334 Revert D5001637: [Caffe2][RNN] Threaded dependency-aware RNNExecutor (frontier/diagonal execution).
Summary:
This reverts commit 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8

bypass-lint

Differential Revision: D5001637

fbshipit-source-id: 4d6250ae7e66ea0aa635a68d943d552e5db65b69
2017-08-16 03:21:49 -07:00
Aapo Kyrola
453c60ce28 Threaded dependency-aware RNNExecutor (frontier/diagonal execution).
Summary:
This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking.

Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well.

This diff also restores the link-ops that I unlanded earlier.

The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with.

Reviewed By: salexspb

Differential Revision: D5001637

fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8
2017-08-15 23:55:15 -07:00
James Reed
a985355935 Gradient for SequenceMaskOp
Summary: Implement backward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72.

Reviewed By: akyrola

Differential Revision: D5618373

fbshipit-source-id: b831fa69f51d9468c858961f922564159e12b46f
2017-08-12 14:34:29 -07:00
James Reed
0a828768e9 Implement SequenceMaskOp forward pass
Summary:
Implement forward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72.

This implements two modes: a sequence-length based mode and a matrix triangle mode.

Reviewed By: akyrola

Differential Revision: D5615493

fbshipit-source-id: a2ce4a8e655d9b720049010a7856be052c5567eb
2017-08-12 14:34:28 -07:00
Jerry Pan
9372ff7a86 Caffe2: support Tensor in BlobsQueueDB
Summary: Caffe2: support Tensor in BlobsQueueDB

Reviewed By: kevinwilfong

Differential Revision: D5589616

fbshipit-source-id: 66aa6092b6403960c4858abd986771b58be94106
2017-08-11 11:21:14 -07:00
Alexander Sidorov
a7be496fe2 Revert D5589309: modify _LSTM into _RNN to adapt GRU
Summary:
This reverts commit f5af67dfe0842acd68223f6da3e96a81639e8049

bypass-lint

Differential Revision: D5589309

fbshipit-source-id: 79b0a3a9455829c3899472a1368ef36dc75f6e14
2017-08-10 16:42:41 -07:00
Christopher Hay
f2dfb40302 Added amplitude argument to SinusoidPositionEncodingOp
Summary: In order to control the absolute scale/magnitude of the output of this op, added a tuning parameter: amplitude

Reviewed By: jamesr66a

Differential Revision: D5596574

fbshipit-source-id: 3b7e316de55cce6fd686da70aa5658ec3e99b070
2017-08-10 15:27:17 -07:00
Kittipat Virochsiri
eb85258beb CreateMapOp
Summary: Add operator to create empty map

Reviewed By: xianjiec

Differential Revision: D5454652

fbshipit-source-id: ecad6cc58572b378962af08cf02063ef546ed58f
2017-08-09 13:32:19 -07:00
Tao Wu
7b86a34610 modify _LSTM into _RNN to adapt GRU
Summary: GRU is different than LSTM that it only has hidden states but no cell states. So in this case, reusing the code of _LSTM is problematic, as we need to delete the part of creating cell state, and change many other places that use hard-coded 4 (hidden_all, hidden, cell_all, cell) into 2 (hidden_all, hidden). Otherwise GRU will break during the backward pass, when the optimizer tries to apply gradient to each of the parameters, because cell state is never used, so it does not have gradients for the corresponding parameters (i.e., cell_state_w, cell_state_b).

Differential Revision: D5589309

fbshipit-source-id: f5af67dfe0842acd68223f6da3e96a81639e8049
2017-08-09 13:24:45 -07:00
Andrei Chtcherbatchenko
a2204f0b1e Caffe2: Write CUDA version of OneHot operator
Summary: This diff implements CUDA version of OneHot operator.

Reviewed By: bddppq

Differential Revision: D5578543

fbshipit-source-id: 55b70e8ec6ee34b647b9140fecbba31b6968f403
2017-08-08 18:17:39 -07:00
Jianlong Zhong
152d2ae3a8 Implement CUDA version of GRU operator
Summary: Add CUDA version of GRU operator

Reviewed By: jamesr66a

Differential Revision: D5571043

fbshipit-source-id: 332aa64fc8a9116cc33382f2b2907080e58c13b3
2017-08-08 10:57:40 -07:00
Chonglin Sun
8ad382df3c implement LengthsTopK operator
Summary:
It was reverted previously because of lack of schema for gradient op. Added it back and resend.

difference between this diff and previous reverted diff:
1. added schema for gradient operator
2. change line:95 in kmax_pooling_op.h from CAFFE_ENFORCE to CAFFE_ENFORCE_GE

Reviewed By: xianjiec

Differential Revision: D5568867

fbshipit-source-id: 39813b389a5da803967a561249793afdfce00c58
2017-08-07 18:19:29 -07:00
Ahmed Taei
8af625ede2 Implement gradients for Col2Im and Im2Col operators
Reviewed By: jay-mahadeokar

Differential Revision: D5576385

fbshipit-source-id: a0ca4f704fd861f7cc67079041b1d0772fc66920
2017-08-07 15:51:30 -07:00
Ben Zhang
42fb87d0b1 L1Distance Row-wise, instead of cumulative
Summary:
The L1Distance operator used to return a single value denoting the L1 of the entire input, instead of a vector for each input value.

This fixes that.

Reviewed By: Yangqing

Differential Revision: D5570385

fbshipit-source-id: fbab0e0c9262ccbdb3af27262b8baacdeb2d0fc9
2017-08-07 14:09:25 -07:00
Zhicheng Yan
e7192c3b91 image_input_op_dense_multi_label
Summary:
To train an image model, we also can use label embedding vector as supervision as opposed to using SoftmaxLoss/SigmoidCrossEntropyLoss.
In such case, the label is a dense vector. This diff enables such use cases.

Reviewed By: panshen1

Differential Revision: D5556203

fbshipit-source-id: 52c61495e02fab457dc2d43e3345d7dbd5580ab7
2017-08-07 12:38:16 -07:00
Juan Miguel Pino
4d8a8c2e1e Implement dot attention
Summary:
Implement dot attention as described in https://arxiv.org/abs/1508.04025
This saves the computation of weighted encoder outputs in `rnn_cell.py`
When the encoder and decoder dimensions are different, we apply an FC, which corresponds to the general case below Figure 2.
Refactored unit tests.

Reviewed By: jhcross

Differential Revision: D5486976

fbshipit-source-id: f9e9aea675b3b072fbe631bc004199b90a9d95cb
2017-08-06 11:50:16 -07:00
Jerry Pan
fac241bcbc Caffe2: add a DB that's wrapped around a BlobsQueue as an adapter for data from non-DB interface
Summary:
Caffe2: add a DB that's wrapped around a BlobsQueue as an adapter for data from non-DB interface.

This is useful for bridging the gap between DB interface data processing ops (TensorProtosDBInput, ImageInputOp etc.) and data that's coming from arbitrary Python or the pretty intricate Hive reader.

Reviewed By: akyrola

Differential Revision: D5554560

fbshipit-source-id: 01bb0056410f9ade205367d5fefc721f91f5b629
2017-08-06 11:50:14 -07:00
Szymon Piechowicz
12f25c8106 Revert D5545533: [pairatt] implement kMaxPooling operator
Summary:
This reverts commit 8378caaac528a71c154067168787ed493bfb0d37

bypass-lint

Differential Revision: D5545533

fbshipit-source-id: a8d9db807f5b22461b21b7589886cf54861e3757
2017-08-04 01:33:29 -07:00
Yiming Wu
8e1ecb1cfd async sparse length sum op
Summary:
This diff makes SparseLengthsSum(Gradient) Async. It goes through these logics:

1. Adding INDICES to Gradient op input so that we can make it async without device host copies.
2. Registering new 3 input op as gradient for CPU/GPU version of SLS
3. In order to not breaking old nets(they are mostly on cpu), I still register the old 2 input op. So the op schema will not complain when it encounter some old nets that has SLSGradient op in it.

wickedfoo  Sorry this diff might bring you extra work of migrating your optimization effort to this new async gradient op. But we think it is worth it. :(

Reviewed By: dzhulgakov

Differential Revision: D5423188

fbshipit-source-id: 62494a6c52a507c4a4688d5a9e1a2bc720d5370d
2017-08-03 03:04:15 -07:00
Christopher Hay
a4e6ca6956 Added Sinusoidal Position Encoding Op
Summary: Added caffe2 operator to calculate the sinusoidal position encoding for word embeddings, as described on page 6 in  https://arxiv.org/abs/1706.03762.

Reviewed By: jamesr66a

Differential Revision: D5533024

fbshipit-source-id: 1afb35cd7f9d8c71f2635b853e56b2c840f0bc1f
2017-08-03 01:46:46 -07:00
Chonglin Sun
4a8545e3c6 implement kMaxPooling operator
Summary: used by attention model

Differential Revision: D5545533

fbshipit-source-id: 8378caaac528a71c154067168787ed493bfb0d37
2017-08-03 00:48:34 -07:00
Honghao Wei
cb1dd21280 adding operator lp_norm to support calculating l1 norm and l2 norm
Summary: Implement operators LpNorm, which is to calculate the Lp norm of a tensor for regularization(p=1or 2) . Currently, there are only operator L1Distance to calculate the l1 distance of two same-shape tenors. We want to make it take only one input and output the l1 loss. We would do the same for l2 loss. We also plan to implement l_{p,q} loss, but have not decided which p and q to take.

Reviewed By: xianjiec

Differential Revision: D5460051

fbshipit-source-id: d67a38fbc94afa52de26d4a53e4d2b7df3c50b6a
2017-08-02 15:09:08 -07:00
Aapo Kyrola
ab42a95b6f fast path for CUDNN global average pooling
Summary:
KaimingHe  debugged slow model, and found out that global average pooling was hideously slow, even with CUDNN. Turns out CUDNN pooling op (especially backward pass) is not optimized for global pooling.

This adds a fast path for global average pooling with NCHW. This is about 30x faster than CUDNN with 56 x 56 pooling, Compared to equivalent ReduceBackSum, this is about 3x faster.

I will bootcamp the max pooling.

Reviewed By: asaadaldien

Differential Revision: D5533059

fbshipit-source-id: 2d590693d737fa92184603663031d96f6145f304
2017-08-02 11:10:10 -07:00
Junjie Bai
0c7ee02c37 Add CUDA implementation of BooleanUnmask and fixed some bugs in the test
Reviewed By: akyrola

Differential Revision: D5405606

fbshipit-source-id: fd755ee2ec3d742597f7f5500f54caa396db4da4
2017-08-01 16:51:40 -07:00
Kevin Wilfong
60cb55461e Caffe2: Support additional outputs in ImageInputOp
Summary: This allows users to add an arbitrary of additional outputs to ImageInputOp.  These are populated by reading additional TensorProto values from the TensorProtos from the DBReader, and converting them into Tensors.  Similar to labels, only ints and floats are supported, and multiple values are supported.

Reviewed By: panshen1

Differential Revision: D5502019

fbshipit-source-id: 5a8b61b3a8549272a112e8e02cd613d8f9a271ba
2017-08-01 14:36:05 -07:00
Tao Wu
5d304a3b49 add gradient for SparseToDenseMask operator
Summary: add gradient for SparseToDenseMask operator

Reviewed By: kittipatv

Differential Revision: D5320792

fbshipit-source-id: 8ee7f1c87e8270ad6077ed197ce9512524069b59
2017-08-01 13:05:03 -07:00
Aapo Kyrola
e38015756a shape inference for Squeeze
Summary: Add tensor inference function for squeeze, refactor a bit

Reviewed By: asaadaldien

Differential Revision: D5518880

fbshipit-source-id: 5b8cb9154f5f777d4be3612a96d7ed76a9068c0c
2017-07-31 16:04:24 -07:00
Tao Wu
6530db49bc improve pair_wise_loss operator to support multiple sessions
Summary: The diff adds support for rank_loss operator to support computing loss for multiple sessions (batch).

Reviewed By: kittipatv

Differential Revision: D5515465

fbshipit-source-id: 55a01cd5ad21eaeae82875ad136c392fed0dbb26
2017-07-28 15:12:47 -07:00
Dmytro Dzhulgakov
f2090debb0 Optimized SparseLengthsSum
Summary:
Optimised SparseLengthsSum (fp32) for now
1) Specialized  reducer
2) created fast routine with prefetches, loop unrolling, block specailization and register tiling
3) added more variety of block sizes to segment_ops_test.py

Reviewed By: Yangqing

Differential Revision: D5392472

fbshipit-source-id: 8ed9baf1b12ec05bd391cabb390024e6bc60a6f6
2017-07-28 10:10:25 -07:00
Bangsheng Tang
a41cbdec0e float support for square root divide
Summary: to support an operation needed by D5507205

Reviewed By: xianjiec

Differential Revision: D5512522

fbshipit-source-id: a9b3a668c28eff71d1e106dbbb572184df4a7638
2017-07-27 17:40:40 -07:00
Bangsheng Tang
d8443b8ffa BatchGatherOp
Summary:
1. added BatchGatherOp and BatchGatherGradientOp
2. unit tests

Reviewed By: xianjiec

Differential Revision: D5443965

fbshipit-source-id: bdcbb7f9f91c55484372a4bdb1727ae6d49e2018
2017-07-27 10:17:42 -07:00
Ahmed Taei
40b783b746 Fix flaky test due to numerical gradient approximation error.
Summary:
Use smaller step size for GradientChecks and pass seed to help reproducing the
test from logged inputs.

Reviewed By: Yangqing

Differential Revision: D5505698

fbshipit-source-id: fc308efe72d535695ba628944aee1913ba16b2f1
2017-07-26 18:58:19 -07:00
Wojciech Glogowski
8f8dccd2ed distance_op_test from hypothesis_test refactored
Summary:
Moved distance_op_test from hypothesis_test to distance_op_test and
refactored

Reviewed By: akyrola, asaadaldien

Differential Revision: D5495104

fbshipit-source-id: 4a90c75eabeb380ae9d150d6258e9b5b0fbfc5ca
2017-07-26 13:37:08 -07:00
Dmytro Dzhulgakov
cf1ce29631 Fix GPU SparseAdaGrad with empty tensors
Summary: CUDA doesn't like 0-sized grids :)

Reviewed By: Yangqing

Differential Revision: D5495805

fbshipit-source-id: 6819513024978ee6bb70a39b25d23ced06465750
2017-07-25 23:50:54 -07:00
Artem Volkhin
2f5c96a730 Fix Flatten operator for empty tensors
Reviewed By: xianjiec

Differential Revision: D5487475

fbshipit-source-id: f1321e15352b0bbe039312f544a9c2ed78da8732
2017-07-25 17:51:42 -07:00
Tao Wu
5449afa855 use model.create_param instead of using param_init_net directly
Summary: When creating parameters for modelhelper, we should use create_param instead of using param_init_net and model.params directly. The diff rewrite some of these cases in rnn_cell.py in order to make model._parameter_info and model.params consistent.

Reviewed By: kittipatv

Differential Revision: D5477724

fbshipit-source-id: 28c4aaf8f98d9d89125af6a42ad328008f0079e1
2017-07-24 21:17:24 -07:00
Dmytro Dzhulgakov
8930c095c1 Add support for int32 indices in SparseLengthSum and friends
Summary:
Need it for some reference comparison for c2isl.

Also there's an argument that it might be faster on GPU with int32. Doesn't seem to be the case now, but haven't tested with Jeff's changes yet.

Reviewed By: kennyhorror

Differential Revision: D5405482

fbshipit-source-id: dc1a983dce5f06f1111c5634ec475647c94848cc
2017-07-24 17:50:00 -07:00
James Cross
0eda7955bd use internal cell for DropoutCell output prep methods
Summary:
In order to get dimensions right, correctly identify gradients, etc., DropoutCell should call the _prepare_output and _prepare_output_sequence methods of its internal cell for its own such methods.

This bug was identified by NVIDIA intern Syed Tousif Ahmed.

Reviewed By: akyrola

Differential Revision: D5483082

fbshipit-source-id: f6df5b4a0502ed0771056638aab219fb5cc7d964
2017-07-24 14:53:11 -07:00
Yangqing Jia
0deee2194f Add a quick SparseLengthsSum benchmark.
Summary: TSIA - this makes it a bit easy to benchmark sparse lengths sum.

Reviewed By: dzhulgakov

Differential Revision: D5477844

fbshipit-source-id: 89e25c5e0dbf3538877ba1a9abc75a10abfa2757
2017-07-24 13:17:47 -07:00
James Cross
99e79a616b attention with encoder_lengths
Summary:
For RNN attention, we should not include the invalid parts of the encoder output (based on encoder_lengths) in the computation. This diff accomplishes that by forcing logits for those positions to be negative infinity.

Note that the this step can be bypassed by passing encoder_lengths=None, which is what we do for beam search, thus incurring no extra overhead for inference.

Reviewed By: jamesr66a

Differential Revision: D5402547

fbshipit-source-id: 1863d6050b5129e4df829c6357f0aa9ded0715dc
2017-07-23 10:06:01 -07:00
Junjie Bai
efe2d01a3e Fix some bugs in CPU version of BooleanMask and add GPU version
Reviewed By: akyrola

Differential Revision: D5397208

fbshipit-source-id: 0314cc181e315f3b6cda846292b2e2ea73bb015b
2017-07-21 11:38:49 -07:00
Geet Sethi
2dc8851206 RNN Workspace Blob Extraction
Summary:
Added operator RecurrentNetworkBlobFetcherOp that takes as input a scratch workspace name and prefix, and copies over all blobs in the scratch workspace into the global workspace. This essentially extracts all intermediate recurrent network computation for each timestep.

Added a wrapper in recurrent.py - retrieve_step_blobs(net, prefix='rnn') - which, when called after an rnn is run, will return a list of all blobs extracted from the net.

Reviewed By: akyrola

Differential Revision: D5421926

fbshipit-source-id: 0f35b466d77d3c719fb0e32de7dbcafc6c0d5225
2017-07-17 10:24:18 -07:00
Robert Verkuil
97193478c7 Implemented GRUCell
Summary: Implemented python logic and tests to create an RNNCell for GRU.  Uses the preexisting GRU Unit Op code.

Reviewed By: salexspb

Differential Revision: D5364893

fbshipit-source-id: 2451d7ec8c2eacb8d8c9b7c893bfd21b65fb9d18
2017-07-10 17:52:25 -07:00
Robert Verkuil
2409c2e359 GRUUnit Op Backwards Pass
Summary:
Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell.
Functions were created to mimic LSTM implementation as closely as possible.
Backwards pass implementations are defined in GRU_unit_op.{h, cc}
assertGradientChecks call added to gru_cell_test.py

Reviewed By: salexspb

Differential Revision: D5364856

fbshipit-source-id: 09cff4478091827763b40cc331e4e0abf0ec258f
2017-07-10 17:52:24 -07:00
Robert Verkuil
279f3f095e Implemented Gated Recurrent Unit (GRU) c++ operator forward pass
Summary:
Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell.
Functions were created to mimic LSTM implementation as closely as possible.
Implementation defined in GRU_unit_op.{h, cc}
tests put in gru_cell_test.py, which import rnn_cell_test_util.py for sigmoid, tanh, and _prepare_rnn functions.

Reviewed By: jamesr66a

Differential Revision: D5363697

fbshipit-source-id: f9ba9fe0be01ffc868dd22027be8be4975b84998
2017-07-10 17:52:23 -07:00
Robert Verkuil
48bd102b95 Moved sigmoid, tanh, and _prepare_lstm (renamed) to a util file.
Summary:
Moved sigmoid, tanh, and _prepare_lstm (renamed) to a util file.
Also renamed _prepare_lstm to _preapare_rnn since it is being used for both setting up and LSTM and GRU model.

The reason for this commit is to allow the creation of GRU Op and testing code without copying and pasting code for sigmoid, tanh, and setting up an rnn unit op mode.

Reviewed By: jamesr66a

Differential Revision: D5363675

fbshipit-source-id: 352bd70378031f1d81606c9267e625c6728b18fd
2017-07-10 17:52:22 -07:00
Kevin Matzen
c096c188c3 minor leaky relu bug fixes
Summary:
numpy.random.rand generates samples from [0, 1) and therefore, the leaky relu test cases weren't testing negative inputs.  Tests still pass after change.

Leaky relu can be used in-place, but gradient took X rather than Y.  Technically, the result is no different as it's just used for a sign test in the gradient, but updated it to take Y to reduce confusion.

Differential Revision: D5390126

fbshipit-source-id: d0c428abbb2797eb33902a7d2a2f59d5e85daaa6
2017-07-10 16:04:45 -07:00
Junjie Bai
ff3996acb9 Add NormalizeL1Op for doing L1 nomalization along given axis
Reviewed By: salexspb

Differential Revision: D5380220

fbshipit-source-id: 38fc56a1013c25b0c8b0fc161ca54fea412fb8b2
2017-07-10 10:10:36 -07:00
Bangsheng Tang
5f63f5697a IndexHash
Summary:
1. IndexHashOp
2. Helper class SparseFeatureHash
3. FeatureSpec changes to add desired_hash_size

Reviewed By: kennyhorror

Differential Revision: D5361370

fbshipit-source-id: bf02e3ca12b3654f1d291f77c8af9248b6c4ac55
2017-07-07 23:06:11 -07:00
Geet Sethi
86b6a6e2f8 Added PiecewiseLinearTransform CUDA Op
Summary: Added a CUDA implementation of the PiecewiseLinearTransformOp.

Differential Revision: D5378537

fbshipit-source-id: 38857f59f5cc52e16e1ecc97983a0b0b82a46c74
2017-07-07 15:20:00 -07:00
Clément Godard
cb7f17ab64 added gradients for ResizeNearest (CPU + CUDA) and ref
Summary:
# Added the gradients of the operation for both CPU and CUDA kernels.
  # Unified variable names across all ops.
  # Added reference implementation in numpy.
  # The gradient check needs a larger stepsize to succeed, is that normal?

Reviewed By: akyrola

Differential Revision: D5313682

fbshipit-source-id: aceb92649e01c5caeba8774e678f9095502d396c
2017-07-07 14:19:42 -07:00
Tao Wu
5aa147f273 added PackRNNSequence and UnpackRNNSequence operators
Summary: Added two operators that can be used to tranfer data into the input format of RNN and back.

Reviewed By: kittipatv

Differential Revision: D5329886

fbshipit-source-id: 07eac29416427b08c49989d4eeed50a6f18493a1
2017-06-30 09:53:31 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Andrew Tulloch
6c67a753c7 Fix test_pair_wise_loss_predictions
Summary: Increase absolute error tolerance.

Reviewed By: tomdz

Differential Revision: D5349604

fbshipit-source-id: 8e04001b0b6a6e83083f341e265ab3c0d2b06918
2017-06-29 12:48:04 -07:00
Luke Yeager
c3b4d277bf Tests: fix test_convolution_sync()
Summary:
This bug in the test was exposed by https://github.com/caffe2/caffe2/pull/861 (previously, the test was always using the cuDNN engine, regardless of the value of `engine`). This bug is now blocking https://github.com/caffe2/caffe2/pull/817.
```
____________________ TestConvolution.test_convolution_sync _____________________
...
            if use_cudnn and requested_engine != 'CUDNN':
                raise ValueError(
>                   'When use_cudnn=True, the only engine you can specify is '
E                   ValueError: When use_cudnn=True, the only engine you can specify is "CUDNN"
```
https://travis-ci.org/caffe2/caffe2/jobs/247605579
Closes https://github.com/caffe2/caffe2/pull/881

Differential Revision: D5332619

Pulled By: akyrola

fbshipit-source-id: 63737768a155359ddbbef1da424fcbb94f86bd4e
2017-06-27 18:07:04 -07:00
James Cross
08cfc72dee Increase threshold for test_unroll_attention
Summary: To 0.000001.

Reviewed By: salexspb

Differential Revision: D5323697

fbshipit-source-id: 5a06c8f5e719b5252e4229704205be37777a8bab
2017-06-27 17:17:32 -07:00
James Reed
07ba98b4b2 Allow specification of SliceOp dimensions via argument rather than via tensor
Summary: This should make it so we no longer have super hacky DAG chains just to generate vectors of indices that could be specified at model creation time

Reviewed By: akyrola

Differential Revision: D5316707

fbshipit-source-id: 97bb3868b69e0c5a7f465c95f2e16ae0485dcc56
2017-06-27 17:17:32 -07:00
Luke Yeager
dfd745a4d1 Conv frontend: checking engine and use_cudnn
Summary:
*Fixes https://github.com/caffe2/caffe2/issues/860*

Raise an exception when the user specifies conflicting values for `engine` and `use_cudnn` in the conv frontend.
Closes https://github.com/caffe2/caffe2/pull/861

Differential Revision: D5329587

Pulled By: akyrola

fbshipit-source-id: 0f1ced9a88c9c6c5a7cb30a070e5bf60129082f0
2017-06-27 09:47:48 -07:00
Luke Yeager
ca2bf16009 Tests: handle missing python-lmdb gracefully
Summary:
Fix issue mentioned here: 875a9850c1 (commitcomment-22773221)

Unblocks https://github.com/caffe2/caffe2/pull/817

/cc tomdz
Closes https://github.com/caffe2/caffe2/pull/871

Differential Revision: D5329573

Pulled By: akyrola

fbshipit-source-id: 855294f76bce82dce6d4bd489244922799848076
2017-06-27 09:47:46 -07:00
Zhicheng Yan
c0445c4426 support_multi_label
Summary: Extend image_input_op to support multi-label binary label vector

Reviewed By: panshen1

Differential Revision: D5318119

fbshipit-source-id: da6757ed9a562f1ab58e3ae5642b7a70d6d499c1
2017-06-27 08:47:59 -07:00
James Reed
24e30534ea Implement SliceGradientOp for CPU
Summary: Implement slice gradient for CPU. Will soon port this over to GPU so NMT can use it

Reviewed By: akyrola

Differential Revision: D5309305

fbshipit-source-id: 8fb5f4e665f236ecce9227c5c0c302f5076b01ad
2017-06-26 21:18:05 -07:00
Andrew Tulloch
cb5af39c69 Vectorize CPU ClipOp implementation (and add test)
Summary: Noticed this wasn't vectorized, could be handy.

Reviewed By: kennyhorror

Differential Revision: D5308593

fbshipit-source-id: c2b35ece34831f0546f010a1ebe0b89f1a7d9446
2017-06-26 11:33:13 -07:00
Luke Yeager
553e4ec20d Refactor conv_test - no cuDNN+dilation+NHWC
Summary:
Place all the cuDNN version checks in a helper function. Easier to use
in future tests and update for newer versions of cuDNN in one place.

Fixes this error in `test_convolution_gradients`:
```
RuntimeError: [enforce fail at conv_op_cudnn.cc:519] status == CUDNN_STATUS_SUCCESS. 9 vs 0. , Error at: /data/caffe2/caffe2/operators/conv_op_cudnn.cc:519: CUDNN_STATUS_NOT_SUPPORTED Error from operator:
input: "X" input: "w" output: "Y" name: "" type: "Conv" arg { name: "stride" i: 1 } arg { name: "pad" i: 0 } arg { name: "order" s: "NHWC" } arg { name: "dilation" i: 2 } arg { name: "kernel" i: 1 } device_option { device_type: 1 } engine: "CUDNN"
```
Closes https://github.com/caffe2/caffe2/pull/839

Reviewed By: salexspb

Differential Revision: D5292123

Pulled By: akyrola

fbshipit-source-id: 513cc742be73c29ffe24e9e964845a217405a73d
2017-06-26 09:20:07 -07:00
James Cross
29887f556f Unrolled test for AttentionCell
Summary: Adding a test to check computational integrity of networks constructed with AttentionCell using UnrolledCell.

Reviewed By: salexspb

Differential Revision: D5306915

fbshipit-source-id: 02acfd1011f7d3ee5fac21cc2778c4a486190c43
2017-06-25 17:21:24 -07:00
Clément Godard
fd86c51c39 Add ResizeNearest
Summary: Added the CUDA implementation of ResizeNearest (forward pass only)

Reviewed By: wickedfoo

Differential Revision: D5290087

fbshipit-source-id: 4291e65b2b4b6a1a197275d5ed8710f40000b59e
2017-06-23 15:49:42 -07:00
Thomas Dudziak
342de07231 Core unit test fixes for Python 3
Summary: As title

Differential Revision: D5291327

fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf
2017-06-23 13:22:16 -07:00
Aapo Kyrola
667b8347a2 stabilize softmax_ops_test
Summary: softmax_ops_test occasionally fails with gradient checks. Stabilize by setting the numpy random seed. Also reduce some dimensions for the large input test to make it run faster.

Reviewed By: harouwu

Differential Revision: D5292106

fbshipit-source-id: a21eec89e18d30ac7c5609dacf5d413e841841a6
2017-06-22 13:50:32 -07:00
Ahmed Taei
a531d74dde ELU CUDA implementation
Reviewed By: wickedfoo

Differential Revision: D5290111

fbshipit-source-id: 95bd0b5467fe064f2fe1b21cb8ec31f150b35e3f
2017-06-21 11:47:13 -07:00
Luke Yeager
d46fe736c8 Fix flaky test in dataset_ops_test.py
Summary:
```
while pytest caffe2/python/operator_test/dataset_ops_test.py::TestDatasetOps::test_collect_tensor_ops; do sleep 0.1; done
```
Run this long enough and you'll see an error like this:
```
Sample histogram: [ 92 109  65 103  99 104  99 125 100 104]
...
>       self.assertTrue(all(hist > 0.7 * (num_to_collect / 10)))
E       AssertionError: False is not true
```
I've seen values like 65, 68, 69, 70. Setting the cutoff at 60 instead of 70 seems safe enough.

/cc Yangqing (or whoever authored a56b881c4a).
Closes https://github.com/caffe2/caffe2/pull/840

Differential Revision: D5292120

Pulled By: akyrola

fbshipit-source-id: 2ea4cbb58e206268759bd9d3639e8921623f519c
2017-06-21 05:35:44 -07:00
Luke Yeager
005156f6b4 Fix gradient checking for softplus op
Summary:
kmatzen why did you set the stepsize in ff84e7dea6?

The test is flaky before this change. Solid afterwards.
Closes https://github.com/caffe2/caffe2/pull/841

Differential Revision: D5292112

Pulled By: akyrola

fbshipit-source-id: c84715261194ff047606d4ec659b7f89dac3cbb1
2017-06-21 05:35:43 -07:00
Luke Yeager
5e084a9112 Don't require pydot for Python tests
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.
```
>       graph = pydot.Dot(name, rankdir=rankdir)
E       AttributeError: 'NoneType' object has no attribute 'Dot'
```
https://travis-ci.org/caffe2/caffe2/jobs/243867951
Closes https://github.com/caffe2/caffe2/pull/827

Differential Revision: D5276691

Pulled By: akyrola

fbshipit-source-id: 047ee869c029002ace29d84c6b56534b7f23f87b
2017-06-19 23:02:00 -07:00
Aapo Kyrola
a5c45e18b5 MaxGradientOp for CUDA + unit test
Summary: As title. Pretty straightforward. Could actually run each kernel in parallel, but we can optimize later if needed.

Reviewed By: Yangqing

Differential Revision: D5278415

fbshipit-source-id: 29f59afe28f37fc4152ec7eb7cd6c1ab65f2cb8c
2017-06-19 22:35:45 -07:00
Dmytro Dzhulgakov
a6fcecaa71 Allow AliasOp to work on empty tensor
Summary: Ran into it while working on a dper benchmark. Apparently it works harmless even with empty tensors.

Reviewed By: akyrola

Differential Revision: D5273672

fbshipit-source-id: a968ae03a659d6c1a215f12cc35f7ba68448e833
2017-06-19 15:24:02 -07:00
Wael Abdelghani
4b4022ded7 Make test_lstm_main more stable
Summary: Title

Reviewed By: Yangqing

Differential Revision: D5268569

fbshipit-source-id: f79c38376ef2dd0684fd438668b0762341d982cf
2017-06-19 12:36:29 -07:00
Luke Yeager
932cf9eb92 Fix entropy error coming from utility_ops_test
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.

`E           InvalidArgument: Insufficient bytes of entropy to draw requested array.  shape=(20, 12, 22), dtype=float32.  Can you reduce the size or dimensions of the array?  What about using a smaller dtype?  If slow test runs and minimisation are acceptable, you  could increase settings().buffer_size from 8192 to at least 43253760.`

https://travis-ci.org/caffe2/caffe2/jobs/243867951

/cc kittipatv
Closes https://github.com/caffe2/caffe2/pull/830

Differential Revision: D5276639

Pulled By: akyrola

fbshipit-source-id: 0c21be25ecd931837dc8b0c2cc17048f531350d1
2017-06-19 12:09:32 -07:00
Jeff Johnson
3f860af050 Implement TopKOp for GPU
Summary:
This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU.

There are two algorithm implementations:

-for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data.
-for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense.

Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library.

Reviewed By: jamesr66a

Differential Revision: D5248206

fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3
2017-06-17 08:47:38 -07:00
James Reed
21dc425e07 Optimize SumSqrElementsOp for CUDA
Summary: The old version used one block with 128 threads. Throughput was too low for the NMT use case (calculating squared gradient norms for every parameter), so this increases the throughput. Shaves 7% off CNN model training time per step

Reviewed By: wickedfoo

Differential Revision: D5263748

fbshipit-source-id: adc3bacd11e49ea00c60381d613d993050e899be
2017-06-16 17:03:38 -07:00
Simon Layton
176a841087 Fixes for CuDNNDropoutOp
Summary: Closes https://github.com/caffe2/caffe2/pull/809

Differential Revision: D5263514

Pulled By: akyrola

fbshipit-source-id: 1f1e5bdb6fa551cb1f9beb3e5d3ad9c0c8813ed0
2017-06-15 22:51:12 -07:00
Kittipat Virochsiri
fc2a8d045c adding flatten indices output to TopK
Summary: This makes it easier to gather top-K by group of rows. This is useful in the situation where we want to pick up top-K from batch of fixed length sessions. Let `N` be number of sessions, and `M` be number of examples in a sessions. We would have a batch of `N * M` rows. We can reshape the score blob to `N x M`, and use it as input to `TopK` to select top score for each session. However, without the new output, it's would be inconvenient to gather the rows corresponding to the top scores. The indices are in `[0, K-1)` range. The new output can be used directly as input to `Gather`.

Reviewed By: chocjy

Differential Revision: D5171459

fbshipit-source-id: 69f7b41456c3f9670650ae07afc8fef8328485e9
2017-06-15 15:32:29 -07:00
Luke Yeager
84cc82cf3f Fix stats_ops_test
Summary:
The global StatRegistry doesn't get reset when the workspace is reset.
```
>       self.assertTrue(len(workspace.FetchBlob('k3')) == 2)
E       AssertionError: False is not true
```
https://travis-ci.org/lukeyeager/caffe2/jobs/240162665

/cc azzolini

NOTE: this error doesn't show up if you just run `stats_ops_test.py` directly. It shows up when you run other tests in the same session before this test:
```
pytest -v caffe2/python/
```
Closes https://github.com/caffe2/caffe2/pull/788

Differential Revision: D5259232

Pulled By: salexspb

fbshipit-source-id: 3c72633af6bb61c4fda62195298b1e9574b4cbef
2017-06-15 15:07:57 -07:00
Dmytro Dzhulgakov
e9cba7e69f Option to read from dataset indefinitely.
Summary: Useful for benchmarking

Reviewed By: kdub0

Differential Revision: D5226758

fbshipit-source-id: 6f3e6dd256f2c40ab71e598a7ce47cd06099adff
2017-06-15 15:07:53 -07:00
James Reed
d9d89b191d implement SliceOp for GPU
Summary: Implementation of the SliceOp for CUDA

Reviewed By: akyrola

Differential Revision: D5254287

fbshipit-source-id: 0a1660e1aa161fd088a2d8f886e019c05a1919a2
2017-06-15 14:34:34 -07:00
Luke Yeager
f61e4ca070 Fixes in tests to support numpy >= 0.12
Summary:
```
  File "/data/caffe2/install/caffe2/python/hypothesis_test.py", line 1911, in test_batch_to_space
    (w + 2 * pad) / block_size).astype(np.float32)
  File "mtrand.pyx", line 1404, in mtrand.RandomState.randn (numpy/random/mtrand/mtrand.c:19843)
  File "mtrand.pyx", line 1534, in mtrand.RandomState.standard_normal (numpy/random/mtrand/mtrand.c:20368)
  File "mtrand.pyx", line 167, in mtrand.cont0_array (numpy/random/mtrand/mtrand.c:6127)
TypeError: 'float' object cannot be interpreted as an index
```
```
  File "/data/caffe2/install/caffe2/python/operator_test/tile_op_test.py", line 101, in tile_ref
    tiled_data = np.tile(X, tuple(dims))
  File "/data/caffe2/venv/local/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 881, in tile
    return c.reshape(shape_out)
TypeError: only integer scalar arrays can be converted to a scalar index
```
I also tested to make sure this still works with 0.11.
Closes https://github.com/caffe2/caffe2/pull/787

Differential Revision: D5248087

Pulled By: salexspb

fbshipit-source-id: eff69482a8eabb8ace330003fa326c832b53865f
2017-06-15 14:17:20 -07:00
Aapo Kyrola
7bf4c0e0fb support RNNs in ExtractPredictorNet
Summary:
We need to support RNNs explicitly in ExtractPredictorNet, because they store sub-nets as strings in special arguments. When netdef argument arrive, we can generalize this a bit.

Added a test under rnn_cell_test to test that extracting an LSTM predictor net works correctly and sets the device option properly for the step net ops.

Reviewed By: yqwangustc

Differential Revision: D5236334

fbshipit-source-id: cd653427f8c440a14d94195a532d18276f94749a
2017-06-14 22:32:29 -07:00
haracejacob
2ec294a8bb Fix a few typos and grammars in comment
Summary:
Fix a few typos and grammars in comment

by using language-check, python library
spell_checker source code is here : https://github.com/17-1-SKKU-OSS/011A/blob/master/spell_checker/spell_checker.py
here is the text file which indicates what things should be fixed :  https://github.com/17-1-SKKU-OSS/011A/tree/master/spell_checker/fix/caffe2
Closes https://github.com/caffe2/caffe2/pull/719

Differential Revision: D5165118

Pulled By: aaronmarkham

fbshipit-source-id: 7fb8ef7a99d03cd5fd2f9ebdb01b9865e90fc37b
2017-06-14 18:22:39 -07:00
Ahmed Taei
94d42b03fb MaxReduction ops GPU implementation.
Summary:
Move rowwise-max kernel from Softmax to math_util library and implement
colwwise-max kernel and MaxReduction ops.

Reviewed By: akyrola

Differential Revision: D5240329

fbshipit-source-id: a07281a877324de459aace33ff21175a68cfd8f6
2017-06-14 11:02:46 -07:00
Bokai Cao
0f787a01bc map operator (move maptrait def out of class)
Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed.

Differential Revision: D5224449

fbshipit-source-id: 2f60754ed3ba6ed16039c09019117ae3c3646ab2
2017-06-12 14:52:04 -07:00
Bokai Cao
e01769ece5 map operator
Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed.

Differential Revision: D5166513

fbshipit-source-id: 748527c423a163fe55f914c08fff3adfc74a540c
2017-06-09 15:17:29 -07:00
Alexander Sidorov
df72826ead Static RNN
Summary:
Static RNN allows to unroll an RNN into Caffe2 graph using all existing cell abstractions. In this diff I introduce several new tests that already caught a few bugs in our RecurrentNetworkOp gradient accumulation logic by comparing it to an unrolled version.

Another use case is perf - potentially we can run an unrolled net faster because DAGNet will have access to the whole graph. Same about memonger. But this work is not part of this diff

Reviewed By: akyrola

Differential Revision: D5200943

fbshipit-source-id: 20f16fc1b2ca500d06ccc60c4cec6e81839149dc
2017-06-08 17:48:48 -07:00
Luke Yeager
52ee7697f4 Fixing broken Python tests
Summary:
`brew_test.py` is just plain broken. `core_test.py` doesn't work with pytest. `apmeter_test.py` and `top_k_test.py` don't work for CUDA builds.
Closes https://github.com/caffe2/caffe2/pull/765

Differential Revision: D5211817

Pulled By: Yangqing

fbshipit-source-id: 78ec5af35a3fa870978e4c9590210ade9e3bc5ac
2017-06-08 13:34:46 -07:00
Luke Yeager
75f1da327d Skip Python tests which require opencv or lmdb
Summary:
Neither dependency is required by the core Python modules.

OpenCV, in particular, is a pain to install (no pip package). Conditionally skipping this test will make TravisCI integration easier.
Closes https://github.com/caffe2/caffe2/pull/739

Differential Revision: D5211799

Pulled By: Yangqing

fbshipit-source-id: c6bdc8a17977f64f34e968fd9ab8c65161d2624d
2017-06-08 13:34:43 -07:00
Ran Xian
4316fb4876 Implement APMeter op
Summary: Implements an APMeter operator (APMeterOp) to calculate AP for multilclass classification given prediction socres and labels. The Op takes a score tensor [nsamples x nclasses] and a label tensor [nsamples x nclasses], and outputs a float tensor of size nclasses as the AP for each class.

Reviewed By: akyrola

Differential Revision: D5082565

fbshipit-source-id: ae7304bc8fc999c361245b9aec38eb9a5f5eef4b
2017-06-07 15:03:04 -07:00
Thomas Dudziak
d524d5b481 Fixes zip/izip for Python 3
Summary: As title

Reviewed By: salexspb

Differential Revision: D5154186

fbshipit-source-id: 2ef24557d82ae16d3bdfbc90a4cc96be8e2dc6c3
2017-06-07 00:04:26 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Ahmed Taei
4c5d101caf Implement ColwiseMax and RowwiseMax reduction ops.
Differential Revision: D5192949

fbshipit-source-id: e7e877b4bea19dd1be94449d45d2733f4858b8e7
2017-06-06 21:17:29 -07:00
Wenyi Huang
7723129d14 Add gradient for topK op
Summary:
Input of topK op: X (dense)
Output of topK op: Value and Indices (sparse representation)
Value will have gradient in some cases,

We backprop (copy) the gradient from sparse (d Value) to dense (d X)

Differential Revision: D5133461

fbshipit-source-id: 7bad55b60e8a22dfe0e51357ce2099d7f752c133
2017-06-06 14:20:06 -07:00
Luke Yeager
d8d1cd1064 Test smaller tensors in segment_ops_test
Summary:
It's causing problems inside docker containers:

`InvalidArgument: Insufficient bytes of entropy to draw requested array.  shape=(5, 9, 10, 5), dtype=float32.  Can you reduce the size or dimensions of the array?  What about using a smaller dtype? If slow test runs and minimisation are acceptable, you  could increase settings().buffer_size from 8192 to at least 18432000.`
Closes https://github.com/caffe2/caffe2/pull/707

Differential Revision: D5162621

Pulled By: Yangqing

fbshipit-source-id: 55544210961cbc80828dca2cbeba6a5ace8cf8d1
2017-05-31 20:17:31 -07:00
Luke Yeager
e2cf007dc8 Avoid numpy VisibleDeprecationWarning in test
Summary:
This warning becomes an error with https://github.com/numpy/numpy/pull/6271 (`>=0.12.0`).

```
caffe2/python/operator_test/tile_op_test.py::TestTile::test_tilewinput
  /opt/caffe2/caffe2/python/operator_test/tile_op_test.py💯 VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
    dims[axis] = tiles
  /usr/lib/python2.7/dist-packages/numpy/lib/shape_base.py:873: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
    return c.reshape(shape_out)
```
Closes https://github.com/caffe2/caffe2/pull/710

Differential Revision: D5160776

Pulled By: Yangqing

fbshipit-source-id: b264e0e389de5817a289db878c15e655f9fa2f09
2017-05-31 20:01:30 -07:00
Aapo Kyrola
96d8ae2163 Make fills work with input_shape when run in CUDAContext
Summary: If ConstantFill (or other fill op) is used in CUDAContext, with input_as_shape, the code crashes as it expects the shape be in CUDAContext but accesses the array in host code... We could fix this by copying the values from the CUDA tensor, but it is probably best to enforce the shape param is in CPU context. This is what this diff does.

Differential Revision: D5152766

fbshipit-source-id: 0629a189bd1d800c0b7c9dbc324b78d279efac0b
2017-05-30 20:47:16 -07:00
Thomas Dudziak
47e921ba49 Remove map() and filter() in favor of comprehensions
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand

Reviewed By: akyrola

Differential Revision: D5142049

fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
2017-05-30 15:32:58 -07:00
Luke Yeager
0a9684c3b9 Mark in-place GPU dropout as broken, add test
Summary:
I'll let y'all decide how you want to fix this (probably need a persistent curand buffer). Here's a test to verify the fix.
Closes https://github.com/caffe2/caffe2/pull/495

Differential Revision: D5148815

Pulled By: akyrola

fbshipit-source-id: e80dabe65230ddd32340f2d872cd8786ac960bf8
2017-05-30 12:35:22 -07:00
Luke Yeager
a47652379f Fix SparseAdagrad for indices.ndim>1
Summary:
Same fix as https://github.com/caffe2/caffe2/pull/249, but for SparseAdagrad.

Also update the tests for both ops to test this functionality.
Closes https://github.com/caffe2/caffe2/pull/675

Differential Revision: D5148750

Pulled By: akyrola

fbshipit-source-id: d30b722429bc547fd53400c1a29e4ee9e2e6ed18
2017-05-30 12:02:18 -07:00
Luke Yeager
16b240145a Fixing some tests
Summary:
As dzhulgakov said at https://github.com/caffe2/caffe2/pull/227#issuecomment-295084443, it would be nice to avoid this stream of CPU-only test fixes.

The second fix could have been avoided if tests were run on TravisCI. I think the TravisCI infra could be greatly improved if we used ccache like your colleagues at PyTorch: https://github.com/pytorch/pytorch/pull/614. Would you be interested in a PR which does this?
Closes https://github.com/caffe2/caffe2/pull/547

Differential Revision: D5147405

Pulled By: akyrola

fbshipit-source-id: 5e9a4571d364c5f0ed8a5e216c9b6136dd4d10be
2017-05-30 09:16:48 -07:00
Anmol Kalia
7f98dc28cb Refactored spatial softmax
Summary: Refactored SoftmaxWithLoss by removing the code for spatial=1 mode and created a new op SpatialSoftmaxWithLoss that has the spatial mode implemented.

Reviewed By: viswanathgs

Differential Revision: D5104120

fbshipit-source-id: 8ab999e32c916b2a39a670a7b2a3365401535f24
2017-05-26 14:50:43 -07:00
Aapo Kyrola
d60a2e3c58 UnsortedSegmentSum/Mean for CUDA
Summary:
To make optimizer for sparse gradients work with CUDA, we need UnsortedSegmentSum and Mean implemented for CUDA. Unique was already implemented by harouwu.

Pretty straightforward implementations, should be fast enough -- and i don't know a faster way anyway.

Added some tests as well.

Reviewed By: asaadaldien

Differential Revision: D5124548

fbshipit-source-id: 63ae72f45fc2f07470603f7b2de12f34635dbb3d
2017-05-26 09:33:49 -07:00
Mohamed Fawzy
e35a4fe5cc Implement SizeOp as requested in github issue#583
Summary:
Implement SizeOp that returns the number of elements in the input
tensor.

Output is 1D tensor that contains the number of elements

Reviewed By: akyrola

Differential Revision: D5101061

fbshipit-source-id: d1c56053b6f3b41c65ac574dd748482775d1ea0d
2017-05-25 11:07:35 -07:00
Aapo Kyrola
f2303ccb77 fix tileop test
Summary: Gradient test for tile op was flaky because i had made the dimensions too large. This caused push blocking errors. Also I noticed my test_grad_tile was incorrect.

Reviewed By: asaadaldien

Differential Revision: D5126476

fbshipit-source-id: ae9ce5d9041648d7a4535fc88d4013e669bd6f02
2017-05-24 18:32:01 -07:00
James Cross
c39f6cf2d0 gradient accumulation fix
Summary: As noted by salexspb, MultiRNNCell had unreliable gradient computation. The problem was that recurrent gradient and gradient computed wihtin the backward step net were not being accumulated during the backward pass, but rather writing to the same blob, thus overwriting each other. This diff fixes that by artificially introducing an extra blob for the internal output, and then accumulating it into the gradient coming from the recurrent connection.

Reviewed By: salexspb

Differential Revision: D5110059

fbshipit-source-id: 16add50989fe8866361bbc21afce5f214c5292fd
2017-05-24 10:33:32 -07:00
Aapo Kyrola
2b11adb414 TileOp CUDA fix: number of threads must be hard coded
Summary:
I had "optimized" the number of threads / block, but cub::BlockReduce has a static template parameter for the number of threads, and this must match. Probably tests still passed because typically the initial numbers are zeros.

Also added a stronger test.

Thanks ves for the report.

Differential Revision: D5110901

fbshipit-source-id: c1169b1286e204c202b0727448ddb51b4965eacb
2017-05-23 09:32:19 -07:00
Ahmed Taei
09bbd0382c ConvNd cuDNN
Summary: Add ConvND cuDNN implementation.

Reviewed By: akyrola

Differential Revision: D4702205

fbshipit-source-id: 65275bcff3970b0d43ac5c168d38bcd075985979
2017-05-19 15:20:33 -07:00
Yiming Wu
65750349ba deprecate CNNModelHelper in python/operator_test dir
Summary:
deprecate CNNModelHelper in python/operator_test dir

BTW I found that there is 2 mkl_speed_test. I am confused...

Reviewed By: salexspb

Differential Revision: D5094122

fbshipit-source-id: f6526f4de334f2245eb4c1f204a8ec9f23750d78
2017-05-19 12:17:17 -07:00
Ahmed Taei
32bf7a2c2b Generalize PoolingOp(cuDNN) to compute 2D and 3D pooling.
Reviewed By: akyrola

Differential Revision: D5090689

fbshipit-source-id: f9f11e12adc0ee8db088f3397a8c33aa31eb5deb
2017-05-19 10:19:00 -07:00
Pooya Davoodi
307459eb62 Fix conv_test for CUDNN dilated convolution in NHWC
Summary:
CUDNN dilated convolution was added to V6. This version of CUDNN does not support NHWC for dilated convolution.

Fix conv_test.py so that it does not test CUDNN for dilated convolution in NHWC format.
Closes https://github.com/caffe2/caffe2/pull/598

Reviewed By: akyrola

Differential Revision: D5084835

Pulled By: asaadaldien

fbshipit-source-id: 3c0c5ed02c5d9232fca567e387ab6260d71e5aaf
2017-05-18 10:07:28 -07:00
James Reed
85f1d947dd Vectorize SigmoidOp on CPU
Summary: I noticed that Sigmoid was taking an inordinate amount of time in our NMT benchmark, so I looked at the implementation and it didn't seem optimal. I replaced the implementation with an Eigen version so that when the Eigen update goes through, we will get proper AVX(2) vectorization.

Differential Revision: D5082464

fbshipit-source-id: aa951f7d730fc05198f7dd04076ec58d471b74c8
2017-05-17 20:33:36 -07:00
Ben Zhang
12edbcb154 Implemented L1Distance Operator for CUDA
Summary: Added L1Distance Operator for CUDA, as well as tests.

Reviewed By: bwasti

Differential Revision: D5071966

fbshipit-source-id: 4c3d862605e9123d955bf091efa67d0731bd816a
2017-05-17 17:32:53 -07:00
Pieter Noordhuis
bbd7aee9ab Revert D4952993: [Caffe2] fix mkl_sparse and migrate sparsity experiments
Summary: This reverts commit 86c03676ab4e47f04d2d0dd438a4a1c849bbbff0

Differential Revision: D4952993

fbshipit-source-id: 5c213c48ac44ce6aefccacc6d80534648d3c516a
2017-05-17 14:46:56 -07:00
Yiming Wu
f359d70ae7 fix mkl_sparse and migrate sparsity experiments
Summary:
Migrate experiments folder to fb/sparse folder. Keep FunHashOp and SparseFunHashOp because they are now assumed as a default Op in depr. What I did

  # Migrate FunHashOp and SparseFunHashOp and their unitests to core-caffe2, make sure tests are passed.
  # Migrate other Ops in experiment folder to fb/sparse folder. Write new TARGETS files for them. Make sure tests are passed.
  # Make sure all related tests passed.
  # Fix MKL definition btw. Make sure that FC_Sparse is not compiled when there is no MKL support

Reviewed By: salexspb

Differential Revision: D4952993

fbshipit-source-id: 86c03676ab4e47f04d2d0dd438a4a1c849bbbff0
2017-05-16 18:33:51 -07:00
Ben Zhang
93f1d0ca7c L1 Operator
Summary: Adds the L1 Distance operator to distance_op.

Reviewed By: bwasti

Differential Revision: D5007719

fbshipit-source-id: fd547c6645cf5f87305e9ebfd95ed918779c1d2a
2017-05-11 18:03:10 -07:00
Ahmed Taei
8df51a84ac Support 3D&1D SpatialBatchNorm[CPU]
Summary:
Generalize SpatialBatchNorm CPU Op to compute Spatial batch normalization for
1D, 2D & 3D input tensors.

Reviewed By: dutran

Differential Revision: D5043563

fbshipit-source-id: 7fcb933a628dd47f13aa622f63601a87382f09cd
2017-05-11 09:32:54 -07:00
Romain Cledat
e16ea46013 Extended ImageInputOp
Summary:
Added several features to the ImageInputOp:
  - bounding box (per image as well as default for the operator). For per-image, it
    only works in Caffe2 format and is passed as the third tensor in the form
    (ymin, xmin, height, width). For the operator, pass bounding_xmin, bounding_ymin,
    bounding_width and bounding_height as parameters.
  - per-channel mean/std. You can use the usual mean/std to pass a single
    value to be used for all channels or also pass mean_per_channel and std_per_channel
    to specify different values per channel. Order of channels is BGR.
  - A minimum size parameter that can be specified instead of the scale parameter.
    The minsize parameter will only scale the image if it is smaller than required.
    This differs from scale which will scale up as well as down. You can only specify
    one of scale or minsize.

Added a test case to test some of the features

Differential Revision: D4874988

fbshipit-source-id: 437191052a46e9916defe8b100d7cc7864373f61
2017-05-10 17:52:01 -07:00
Yury Zemlyanskiy
3abd0cb623 Add axis argument to SoftmaxWithLoss
Summary: ##axis## argument for SoftmaxWithLoss (it doesn't yet work for spatial case).

Reviewed By: akyrola

Differential Revision: D5025797

fbshipit-source-id: 9e3cf39223af3f2c8bb357f8d9fe952b7349f913
2017-05-09 19:36:00 -07:00
Simon Layton
1d0ba2cfbd New cudnn ops
Summary:
cuDNN versions of dropout and LRN (for native fp16 support), port of Caffe's max pooling algo that uses an explicit mask to store locations (also supports fp16 storage)
Closes https://github.com/caffe2/caffe2/pull/396

Reviewed By: akyrola

Differential Revision: D4990880

Pulled By: asaadaldien

fbshipit-source-id: a716acffb656843e9b31e3e6808bd2d8aa959d03
2017-05-08 16:33:21 -07:00
Yury Zemlyanskiy
11052d03aa RNNCell API change: returns states and outputs
Summary:
Incorporating definition of cell's output and illustraing it's usage by adding dropout to all types of cell.

I think that we should try to get rid of aliases in RecurrentNetwork, so output of applied_over_sequence is also always (state_1_all, state_2_all, ...). This way we can merge get_output_from_single_step, get_output_from_sequence and get_outputs_with_grads into a single method

Let me know what do you think!

Reviewed By: jhcross

Differential Revision: D4992913

fbshipit-source-id: 737939be336ad145f84e8733cd255d4f7188ef70
2017-05-08 15:19:48 -07:00
Kevin Matzen
0cb7774445 softplus op
Summary: Added softplus function, f(x) = ln(exp(x) + 1)

Reviewed By: akyrola

Differential Revision: D5011057

fbshipit-source-id: 5fddb1568fee625f81ea3a86a85d0f400c3ee278
2017-05-08 10:40:25 -07:00
Jon Morton
ac1c63dda8 Add specialized ResizeNearest implementation for scale=2
Summary:
Specialized implementation of ResizeNearest for width_scale=2 and height_scale=2. This implementation doesn't use divides or calls to std::min, and is unrolled 2x over the width dimension. Also add a correctness test.

About 6x faster.

Reviewed By: ajtulloch

Differential Revision: D4928579

fbshipit-source-id: 5cc92a52bd688690fee907b4333d9c84b666f9c9
2017-05-07 21:10:11 -07:00
Du Tran
033ab9da1b Adding video data layer for caffe2
Summary: Adding a simple video data layer which allows to read video data from frames, videos and output 5D tensor. It also allows multiple labels. The current implementation is based on ffmpeg

Differential Revision: D4801798

fbshipit-source-id: 46448e9c65fb055c2d71855447383a33ade0e444
2017-05-05 14:16:38 -07:00
James Cross
5c667ebe4e AttentionCell
Summary:
This diff creates a generalized AttentionCell class, which will allow us to construct attention decoders out of arbitrary RNNCell components (with a particular view to using stacked, multi-layer RNNs).

In order to do this, we introduce a new optional input for RNNCell._apply which allows us to provide an additional input that is not processed by prepare_input(). Note that this is an argument only to _apply, not apply, since it is only meant to be used for additional recurrent connections to "embedded" cells, not for standalone RNNs.

Reviewed By: urikz

Differential Revision: D4998465

fbshipit-source-id: 473009ea4917e86e365f9d23aa2f11a46a94fd65
2017-05-05 12:33:01 -07:00
Romain Cledat
aa5e771042 Added tiles and axis as input parameters to Tile Operator
Summary:
Added the possibility to add 'tiles' and 'axis' as input
as opposed to arguments for the Tile Operator. If provided, the input
values will override the argument values. Now with proper CUDA code

Differential Revision: D4930347

fbshipit-source-id: b44b032b327c7d7bddfce63abf4e3289d7e74bfb
2017-05-04 23:46:51 -07:00