pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	0954775d28	AddPadding CUDA version Summary: CUDA version of the AddPadding op. It first executes a prefix-sum using Cub to compute the cumulative lenghts array. Then it launches a kernel that uses this information to fill the output tensor with start, end paddding and the actual contents. Reviewed By: asaadaldien Differential Revision: D6391413 fbshipit-source-id: 45b431e5976674729e53cb4752c7753c1d8a69e8	2017-11-22 18:17:21 -08:00
Andrew Tulloch	48415d83c8	Fix instance_norm_test.test_instance_norm_model_helper Reviewed By: jerryzh168 Differential Revision: D6391749 fbshipit-source-id: ba861d401e358290782db8f360c430e3f3daae96	2017-11-22 15:05:29 -08:00
Yiming Wu	127a55ae49	cast op for empty batch Summary: Cast op cuda can deal with empty batch now. Reviewed By: azzolini Differential Revision: D6350138 fbshipit-source-id: 2f3d19f4d42ff34806aa9597690e66f6b4de1a6b	2017-11-16 12:20:20 -08:00
Wenyi Huang	d8dfaeeef7	Add batch-based/row-based sparse from/to dense operator Summary: Two ops: BatchSparseToDenseOp and DenseToBatchSparseOp Inverse operations of each other. Details are described in op Doc These op is used along with flexible topK, where the output is lengths, indices, and values. We want to do softmax on the values, but the dimension of each batch is different. So these op will convert sparse representation to dense and vice versa. The two ops are also gradient op for each other. Reviewed By: chocjy Differential Revision: D6288338 fbshipit-source-id: 0ba9e611058b39e46e7414dcc5f39cab29915fa3	2017-11-16 00:59:21 -08:00
Xiaolong Wang	3bde37fbf0	Listwise Ranking -- LambdaNDCG Summary: This is part one: It adds lambdaNDCG loss which can be used to heuristically optimize the NDCG metric. Differential Revision: D5830650 fbshipit-source-id: 1eb696337c9a77727ad40219c68f6468e2e097a5	2017-11-16 00:05:48 -08:00
Simon Layton	1ab3fd1a29	Fix Batched Matmul test accuracy Summary: Datatypes was being handled badly in reference check, causing sporadic fails in CI. All batched mat-mul with fp16 data is performed as pseudo-fp16, with all math in fp32. Adjusted the reference implementation to reflect this. Adjusted the gradient check threshold to the best I could get to consistently pass. Closes https://github.com/caffe2/caffe2/pull/1406 Differential Revision: D6324431 Pulled By: pietern fbshipit-source-id: 83ff2584438a11f7a6db4599a4fb0e75e9e15a3d	2017-11-14 09:31:18 -08:00
James Reed	8701a2dfa3	Allow negative indices in Concat/Split ops Summary: Closes https://github.com/caffe2/caffe2/pull/1440 Reviewed By: dzhulgakov Differential Revision: D6290009 Pulled By: jamesr66a fbshipit-source-id: 93eaff6103211ff89ed63ecaf4aa96d38e6bed63	2017-11-13 18:32:24 -08:00
Yan Zhu	7b047c161d	NegateGradientOp and test Summary: add NegateGradientOp: in forward pass, this op simply copies the input to output. In backward pass, it flips the sign of gradients. Reviewed By: dragonxlwang Differential Revision: D6314456 fbshipit-source-id: 56afd8b131eff9f7e120ab7e4e87461df49649d4	2017-11-13 18:05:14 -08:00
Xianjie Chen	c04ec84e1a	disable uniform fill large blob Reviewed By: pietern Differential Revision: D6299413 fbshipit-source-id: 2ea4a5f1434060c3ab6fd42abd4052bdb10a37cc	2017-11-10 12:10:14 -08:00
Jeff Johnson	0440f3bf93	Reduce caffe2 GPU topk test sizes Summary: The topk GPU test was taking too much time, but there are still a variety of codepaths to test (k <= 1024, k > 1024, k == 1, k == n). Reduce the batch sizes and n to reduce time taken by the in-python CPU code equivalent. Reviewed By: pietern Differential Revision: D6272628 fbshipit-source-id: b8b8f3601f28bf64f144c73d7c9e915f40c84d70	2017-11-10 07:47:00 -08:00
Xianjie Chen	d1c73eb407	use size_t for rand fill functions in math Summary: The number of elements in the caffe2 blob can be larger than int32. Use size_t to prevent overflow. Reviewed By: ajtulloch Differential Revision: D6278363 fbshipit-source-id: 356e294c667a53360d8a65b56a63a39d5ce3384e	2017-11-09 18:44:46 -08:00
Wenyi Huang	7cedf80923	add flexible topK op Summary: Will probably rename to adaptive topK to be aligned with the layer name. The main difference from top_k op is that the K is not fixed as a layer parameter, instead this op takes in a blob that conatins K information for each row of the input data (batch mode). Reviewed By: chocjy Differential Revision: D6221209 fbshipit-source-id: f7fd575ff8f515d886d93278ad94fd17e8bd6fa5	2017-11-09 16:48:14 -08:00
Junjie Bai	e6fadfa76e	Relaxing checks for fp16 in BatchMatMul tests Reviewed By: pietern Differential Revision: D6275557 fbshipit-source-id: e336ba9c897b88801f1be1b32029c5af58ec3fc5	2017-11-08 13:42:28 -08:00
Pieter Noordhuis	348e29c49b	Don't run CUDA tests for ops without CUDA implementation Summary: Closes https://github.com/caffe2/caffe2/pull/1434 Reviewed By: houseroad, ilia-cher Differential Revision: D6272614 Pulled By: pietern fbshipit-source-id: 7b998b08ec02b03f88a6fd24a949b0d199b2aa37	2017-11-08 10:28:02 -08:00
Xianjie Chen	cbb03b8db8	add modulo operator Summary: as desc. Reviewed By: chocjy Differential Revision: D6240026 fbshipit-source-id: fa4dcccebc44b0a713946823b6f56e73d5d6146b	2017-11-06 16:44:16 -08:00
Alexander Sidorov	20feef45bc	NNFC operator: an FC with noTrans noTrans options Summary: This seems to be faster in a bunch of cases. Prefer to keep it as a separate op instead of MatMul + Add so its easy to compare perf on per op basis between this one and the baseline (normal FC) Reviewed By: akyrola Differential Revision: D6169187 fbshipit-source-id: 09b96325d44bd181896f396aec88b27314c435b0	2017-11-03 15:08:39 -07:00
Philipp Keller	68ed66a2c5	Faster BatchBoxCox Operator using MKL Summary: Use MKL VML vsPow() and row-major iteration for faster BatchBoxCox operator. Reviewed By: kennyhorror Differential Revision: D6042052 fbshipit-source-id: 54fc6b9184cb341672183a77730d79a271d09207	2017-11-03 12:04:03 -07:00
Dmytro Dzhulgakov	583bc63c98	Fix boundary checking in 8-bit sparselengthssum ops Summary: Before the boundary checking was happening after the first access for 8bit ops. Reviewed By: Yangqing Differential Revision: D6206753 fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b	2017-11-03 05:19:57 -07:00
Aapo Kyrola	14f95c2782	Updated brew SpatialBN to use initializers Summary: Updated brew SpatialBN to use initializers similar to other brew ops such as conv and fc instead of initilaizing all of its parameters itself within the brew call. Reviewed By: asaadaldien Differential Revision: D5840359 fbshipit-source-id: 9f3d688d4957605eaf7ecd2488bc26bfb1da3f78	2017-11-02 11:25:45 -07:00
Junjie Bai	7c2804ee90	Add support for doing broadcast with single elem dimensions at both ends Summary: Closes https://github.com/caffe2/caffe2/pull/1413 Reviewed By: jamesr66a Differential Revision: D6201556 Pulled By: bddppq fbshipit-source-id: 1d443e895dbb3f5b67a5a0e027977b7807df3de1	2017-11-01 18:33:11 -07:00
Dong Li	3bfabb4d5f	support float16 input for operator SparseAdagrad Summary: Implemented new CUDA class for operator SparseAdagrad. The param and moment inputs now can be float or float16. The functions for mixed-precision add/mult/store are defined in a separate head file ("caffe2/core/float16_util.h") for reuse purpose. Reviewed By: azzolini Differential Revision: D5880200 fbshipit-source-id: dca227f38629a03a9d771f42efe2c0b673075c4d	2017-10-30 19:32:30 -07:00
Aapo Kyrola	669ec0ccba	Added FP16 compute support to FC Op Summary: Allow the GEMMs in the FC/FCGradient Op to do FP16 compute instead of FP32 if the appropriate op flag is set. Reviewed By: asaadaldien Differential Revision: D5839777 fbshipit-source-id: 8051daedadf72bf56c298c1cf830b019b7019f43	2017-10-30 17:03:51 -07:00
Junjie Bai	b7a9f51de3	In BatchMatMul, add support for accepting inputs >=2d Summary: Closes https://github.com/caffe2/caffe2/pull/1399 Differential Revision: D6183083 Pulled By: bddppq fbshipit-source-id: 5c8f17c2de212fbc39a66c90aa2599b714f5ceb4	2017-10-29 23:38:33 -07:00
Qinqing Zheng	42ffb1ae07	support non-normalized weights Reviewed By: akyrola Differential Revision: D6158290 fbshipit-source-id: 4d54e5c0d0f91f23deab18da047df4d209d4c312	2017-10-27 23:18:25 -07:00
Tilak Sharma	7b7dcaf269	Initialize presence tensor if data is empty. Summary: See https://fb.facebook.com/groups/811605488888068/permalink/1645450575503551. Differential Revision: D6116836 fbshipit-source-id: 3072643eaf6f134bda7d224af3d5f8339da1f39d	2017-10-27 01:05:42 -07:00
Qing He	0b0d5b2b1d	Add tensor output that gives the sampled values Summary: Given an additional tensor containing the values corresponding to the weighted samples, add tensor output that contains the values selected by the sampled indexes. Reviewed By: akyrola Differential Revision: D6050094 fbshipit-source-id: 1eccc641b99e30d36ae83d49f630b018a53e4147	2017-10-26 16:04:57 -07:00
Jiyan Yang	6e33ae79df	Add gradient op for WeightedSum op Reviewed By: dzhulgakov Differential Revision: D6149163 fbshipit-source-id: 0e8cf400323233d001243bc5cb25a0025115a564	2017-10-26 00:16:51 -07:00
Ahmed Taei	5bb8ed67e3	Compute GLU for an arbitrary axis Summary: As in title Differential Revision: D6151804 fbshipit-source-id: bd0fa08be1676ebd1abd9720711c221c61c11ad1	2017-10-25 19:49:55 -07:00
Aapo Kyrola	2e4d8aa530	Added FP16/FP32 MomentumSGD + WeightDecay Update Ops Summary: Added two new ops, FP16MomentumSGDUpdate and FP32MomentumSGDUpdate, which perform both the momentum sgd and weight decay updates to a given parameter in a single op -- thus being more efficient. Also updated the standard momentum sgd test to test if nesterov momentum works. Reviewed By: asaadaldien Differential Revision: D5837837 fbshipit-source-id: 5ad487b9c59434491d3a4fcfdeed820db6083f57	2017-10-24 12:28:16 -07:00
Junjie Bai	ed08533a1e	Add CUDA version of ScatterAssign Reviewed By: houseroad Differential Revision: D6128352 fbshipit-source-id: ea59f4bc723ef929b0f6ed15797df776d8054422	2017-10-24 10:20:03 -07:00
Ahmed Taei	512a8015b8	Gated Linear Unit implementation Summary: As titled Differential Revision: D6117600 fbshipit-source-id: 84b0154dc4cf77cc9c9146e9a534c7485989346b	2017-10-23 18:14:57 -07:00
Yarik Markov	c6ef04db04	Add "dtype" parameter for GivenTensorOp Summary: Adding "dtype" parameter for the GivenTensorOp. Also, providing backwards compatibility for the existing code, byt supporting the templating if "dtype" is not provided. Reviewed By: bddppq Differential Revision: D6090049 fbshipit-source-id: f5deaa57b49f2280289975f4583aba5bc064a2bc	2017-10-23 16:06:37 -07:00
Qinqing Zheng	6a4182eead	weighted sample op cuda Summary: CUDA version of weighted sampling operator; minor changes for CPU version Reviewed By: asaadaldien Differential Revision: D6106668 fbshipit-source-id: 42d7607bd845a4a39cf5b89d7476904cb5928431	2017-10-21 18:49:59 -07:00
Badri Narayan Bhaskar	25bfffeafe	Swish Activation Function Summary: Swish: A self-gated activation function. https://arxiv.org/pdf/1710.05941.pdf Reviewed By: ajtulloch Differential Revision: D6100424 fbshipit-source-id: 0103d6d82e9ffb50106c98a8785e62b8808e9af1	2017-10-20 10:37:43 -07:00
Junjie Bai	ee62a595fc	ScatterAssign int types Summary: Closes https://github.com/caffe2/caffe2/pull/1357 Reviewed By: dzhulgakov Differential Revision: D6107036 Pulled By: bddppq fbshipit-source-id: 9278dae988c3c0656b4e4fd08bf7ca1e2eec3348	2017-10-19 23:22:54 -07:00
Dmytro Dzhulgakov	623f2bf815	Add GivenTensorInt64Fill on gpu Summary: Before we fix it properly with 'type' argument. Reviewed By: bddppq Differential Revision: D6103973 fbshipit-source-id: 8c00a93c373dd0ad0bbfe59944495f6574223ab6	2017-10-19 18:32:41 -07:00
Hassan Eslami	db6a9d2ae4	Fixes type inference for Slice and GivenTensorFill operators Summary: Currently, the type inference infers FLOAT as the type for all GivenTensorFill operators. However, the inferred type should match the actual operators. Also, for `Slice` operator, there is a corner case where type inference fails Reviewed By: azzolini Differential Revision: D6096813 fbshipit-source-id: d65b7c0f42436138cbc49d8a5a62374fa5e927e1	2017-10-19 14:02:21 -07:00
James Cross	96c6212513	repeat sequence mask for data dims Summary: Allow the application of sequence-length masking to be replicated along one or more minor axes. See task for details. Reviewed By: jamesr66a Differential Revision: D6090835 fbshipit-source-id: 9064232aa9b93246c582b6e0bae73be5dbe09e98	2017-10-18 18:08:08 -07:00
Bryan Wu	6ac393a32b	WeightedSigmoidCrossEntropyWithLogits Summary: Op for computing SigmoidCrossEntropyWithLogits with per-label, per-sample weights. Can be used for addressing class or label imbalance. Doc: Given three matrices: logits, targets, weights, all of the same shape, (batch_size, num_classes), computes the weighted sigmoid cross entropy between logits and targets. Specifically, at each position r,c, this computes weights[r, c] * crossentropy(sigmoid(logits[r, c]), targets[r, c]), and then averages over each row. Returns a tensor of shape (batch_size,) of losses for each example. Reviewed By: stephenyan1231 Differential Revision: D5997723 fbshipit-source-id: f3172325f1c98b6f26e1700131ef897b743a72fc	2017-10-16 17:34:38 -07:00
Junjie Bai	1735c5f6c7	Add Filler op for double Summary: Closes https://github.com/caffe2/caffe2/pull/1344 Reviewed By: dzhulgakov Differential Revision: D6065137 Pulled By: bddppq fbshipit-source-id: 1849beeaa4fee8cc056b685664f91daca71764b8	2017-10-16 13:48:15 -07:00
Aapo Kyrola	123cb5dd07	use non-cudnn transpose for int tensors Summary: Turns out CuDNN's tensor transform only supports floats. Previous implementation pretended it would work with ints by casting to floats and indeed passed tests for some reason. But rgirdhar found a case where it returned nonsensical results. So rewire int-transposes to use non-cudnn version. Had to refactor a bit for that. Also added a test for the case. Reviewed By: asaadaldien Differential Revision: D6043284 fbshipit-source-id: cc3b14f9fbbdeff421b01da453a1d3c7c5ffd4ac	2017-10-13 14:02:48 -07:00
Junjie Bai	4c3b02f314	Enable Flatten operator to take an arbitrary axis arguemnt Summary: input dimensions up to "axis" will be flattened to the outer dim of output and the remaining input dims will be the inner dim Closes https://github.com/caffe2/caffe2/pull/1330 Reviewed By: dzhulgakov Differential Revision: D6039560 Pulled By: bddppq fbshipit-source-id: e92c30b49a9288feeefc4a639522406e97e149e1	2017-10-13 12:28:22 -07:00
Tilak Sharma	1e8a16224f	PackSegments: return value presence. Summary: Optionally return a blob of shape [batch size, max length] that is false only in locations where the output tensor was padded. One can separately convert lengths to segment ids and cast, but this is more convenient, and possibly more efficient. Differential Revision: D6006073 fbshipit-source-id: af6c4ea31972566e7d059dcd3fdd8afba97a88e9	2017-10-12 11:17:34 -07:00
Evgeniy Shin	c6f96c1d7b	Add GPU support for LengthsTile Reviewed By: kittipatv Differential Revision: D5999171 fbshipit-source-id: cd0e305488f05c20d1925745fca0c4b4eef23071	2017-10-12 11:17:34 -07:00
Andrey Malevich	e13f199452	Switch RNNOp to use NetDef argument for step represenetation. Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well. Reviewed By: salexspb Differential Revision: D5949330 fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f	2017-10-10 22:01:51 -07:00
Zhifeng Deng	66b8cb95e9	Add int64 support to sparse_to_dense_mask_op Summary: [CAFFE2] Add int64 support to sparse_to_dense_mask_op Reviewed By: ender-wieczorek Differential Revision: D6022278 fbshipit-source-id: 489b6df4d43a64c743ee278d94929ca50259f7b8	2017-10-10 17:19:44 -07:00
Jerry Zhang	1b892ea295	Enable axis argument for MatmulOp Summary: att Reviewed By: ajtulloch Differential Revision: D5523365 fbshipit-source-id: b7a379c9c4326cd642e7b4768cc590b5e1b94b6d	2017-10-10 16:47:37 -07:00
Jiyan Yang	2c44a9f9cd	Add BatchBucketOneHotOp Summary: Input is a matrix tensor. Its first dimension is the batch size. For each column, bucketize it based on the boundary values and then do one hot encoding. The `lengths` specifies the number of boundary values for each column. The final number of buckets is this number plus 1. This would also be the expanded feature size. `boundaries` specifies all the boundary values. Note that each bucket is right-inclusive. That is, given boundary values [b1, b2, b3], the buckets are defined as (-int, b1], (b1, b2], (b2, b3], (b3, inf). For example If data = [[2, 3], [4, 1], [2, 5]], lengths = [2, 3], and boundaries = [0.1, 2.5, 1, 3.1, 4.5], then output = [[0, 1, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0, 1]] Reviewed By: xianjiec Differential Revision: D5976030 fbshipit-source-id: fd746c20b19bcdf5f769451d804c219ad6463f28	2017-10-06 13:25:12 -07:00
Qing He	f535700ccc	Add weighted_sampling operator to Caffe2 Summary: Add weighted_sampling operator to Caffe2 Reviewed By: akyrola Differential Revision: D5962199 fbshipit-source-id: ab3f56a1dc7b8eaf4ed4d74af6c6c08dccca5a1e	2017-10-05 20:33:59 -07:00
Wenyi Huang	20b3918ba8	add cuda support for Topk Gradient Summary: as title Reviewed By: azzolini Differential Revision: D5822303 fbshipit-source-id: 3bc88a9071167c41e3fc717a2b31dceee6fee360	2017-10-04 19:31:56 -07:00
Artem Volkhin	a2be56bc34	add GatherRangesToDense operator Summary: adding an operator with behavior similar to fused GatherRanges and Split. Reviewed By: kennyhorror Differential Revision: D5961761 fbshipit-source-id: 616d4668b8901256418004def90d91a0b2041620	2017-10-04 15:18:10 -07:00
Uthsav Chitra	964d740ede	adding batch support to SequenceMaskOps Summary: Added support for batching to SequenceMaskOp. Let b be the batch dim and k be the axis dim. (We enforce that b < k.) Write the dimensions of the input tensor as [a_1, ..., a_b, ..., a_k, ...]. We first collapse our tensor down to 3D, with dimensions [P, Q, D], where: P = a_1 * ... * a_b, Q=a_{b+1} * ... * a_{k-1}, and D=a_k * a_{k+1} * ... * a_n. Then we mask each slice [i, :, : ] of this 3D tensor (note that each slice is a Q times D tensor w/ dimension 2) Reviewed By: jamesr66a Differential Revision: D5733382 fbshipit-source-id: e7a314d9fe6e6691a75112edbee8ba6e8ea8e396	2017-10-04 15:18:09 -07:00
Yangqing Jia	d315c62e72	Kick fbsync Summary: fbshipit-source-id: 886ac051235a878b5b0fe294619bb6184d5d24ab (Note: this ignores all push blocking failures!) Reviewed By: dzhulgakov Differential Revision: D5947236 fbshipit-source-id: c3f7d00d5d7faad6366d4c456fffb9387f30b2aa	2017-09-29 16:31:11 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Andrei Chtcherbatchenko	cb986bb913	Deformable convolution operator in Caffe2 Summary: This diff implements deformable convolution operator. The idea behind it is that instead of using a fixed NxM kernel, we associate a set of learnable offsets (dx, dy) with each element of the kernel, and use bilinear interpolation to estimate weights in between the integer indices. For background see paper https://arxiv.org/abs/1703.06211 and mxnet implementation https://github.com/msracver/Deformable-ConvNets/tree/master/rfcn/operator_cxx To simplify code review of the new files the feature is stacked into 2 diffs. First diff duplicates core convolution operator into a separate set of files prefixed with deform_. It also provides documentation on the operator but nothing else. Second diff contains the actual changes that make deformable convolution possible. Thefore, I recommend focusing your code review on changes between diffs 1 and 2. Current limitations of the operator: 1. Only CUDA is supported. CPU version is not implemented. 2. Only NCHW layout is supported. 3. Only 2d convolution is supported. CUDA code is ported from mxnet implementation with minimal changes. See also inline comments in code for tricky parts. Reviewed By: akyrola Differential Revision: D5702983 fbshipit-source-id: 4d1bf2c6c73135e6a70dbe87037b38915f4453f9	2017-09-26 16:20:31 -07:00
Aapo Kyrola	8a45b65f96	ReduceFrontMax, ReduceBackMax + gradients, CPU and CUDA Summary: Implementation of ReduceFront/Back/Max/Gradient for CPU and CUDA. Reviewed By: asaadaldien Differential Revision: D5905402 fbshipit-source-id: 6967ce41aa95ee5ea7a90065430892e81a6da477	2017-09-26 15:22:25 -07:00
Frank Jiang	0a5ee1e806	Implemented RowWiseSparseAdagrad operator that only keeps one moment term per embedding Summary: Implemented version of SparseAdagrad that only keeps track of an average sum of squared gradients term for each row of the parameter tensor, rather than a sum of squared gradients term for each individual parameter. Differential Revision: D5881918 fbshipit-source-id: bd96ccf25554b457baaaca9309fc8048adbb37f7	2017-09-26 13:34:44 -07:00
Aapo Kyrola	753133f015	SignOp Summary: Equivalent to numpy.sign for CPU and CUDA. Reviewed By: dzhulgakov Differential Revision: D5906446 fbshipit-source-id: 389f994bccbb87a62df2c4aaacc327f9a6223cbd	2017-09-26 09:17:45 -07:00
Junjie Bai	d9b0bcd7a4	Make all existing (except in RoIPool) "is_test" arguments required Reviewed By: akyrola Differential Revision: D5830168 fbshipit-source-id: 8634e9cfe308ba0ee90cd8a5c4b09a47b0b5f015	2017-09-25 23:46:12 -07:00
Alisson Gusatti Azzolini	15a7bb3bff	GatherByKeyOp (Inverse operation of PartitionOp) Summary: Can be used to gather outputs of a sharded "Gather", or for the SparseLengthsSumGradient when we need the gradient on values. Reviewed By: akyrola Differential Revision: D5800901 fbshipit-source-id: 90835755d6d15be13fb0f538cfade980cf4a1cd2	2017-09-24 22:18:17 -07:00
Misha Smelyanskiy	2cbb4167c1	Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Summary: Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators. Performance Results =================== Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator, sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases. [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8 I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type: I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8 I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type: I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16 I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type: I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type: I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum Analysis ======== Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h as shown below. However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that: 1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors 2. In addition to emebding blocks, we are now also reading scale_bias. For every pair of scale and bias, we bring entire cache line of 64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence reading nearly entire extra cache lines of useless data adds to bandwidth wastage. 3. In addition, hardware prefetcher runs past the end of the input block and scale_bias cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/ To get deeper insights into what is going on, we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8, into a microbenchmark, where we varried block size, while keeping table size constant (256MB) block_size time(uint8) time(float16) time(float32) 64 0.19 0.09 0.17 128 0.12 0.09 0.17 256 0.70 0.09 0.14 1024 0.50 0.06 0.10 The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark. However, we see that as block_size increases (for a fixed table size), time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher running past the end of the block. Reviewed By: kennyhorror Differential Revision: D5870907 fbshipit-source-id: 445321b96f1b5801ef91f296f6063c35673ee11b	2017-09-21 14:50:43 -07:00
Dmytro Dzhulgakov	0fff025973	Consistent behavior of max reduction for segment ops and fix test Summary: Two implementation of max pool reducers had different semantics in case of equal indices. It matters less in real cases, but breaks tests. Choosing the behavior of LengthMax over SortedSegmentRangeMax as the former is more widely used. Also some minor tweaks for the test code. Reviewed By: Yangqing Differential Revision: D5870386 fbshipit-source-id: 6488cbd5cacaf595ffc07c44084730dd44b3f9dd	2017-09-20 10:59:43 -07:00
Yangqing Jia	06b7a9e0f6	Backed out changeset 3a5c020294d8 Summary: Broke CAFFE2_HYPOTHESIS_PROFILE=debug buck test //caffe2/caffe2/python:lengths_reducer_rowwise_8bit_ops_test Reviewed By: kennyhorror Differential Revision: D5867880 fbshipit-source-id: 80c6f23eccb59b74be4a7258b4f193d79f814c3f	2017-09-19 17:54:18 -07:00
Misha Smelyanskiy	b468ffe6d1	Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Summary: Adding uint8 support for to code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Added number of unit tests to test these operators. Performance Results =================== Performance results are below for old code, sparse_lengths_sum_benchmark.old.par, that uses code in lengths_reducer_rowwise_8bit_ops.h, and our new code, optimized via code generator, sparse_lengths_sum_benchmark.new.par. Block size was 128 in all cases. [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.old.par --iteration 10000 --dtype uint8 I0912 02:49:58.773259 2640913 net_simple.cc:162] Time per operator type: I0912 02:49:58.773264 2640913 net_simple.cc:171] 0.75769 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype uint8 I0912 02:50:33.981832 2642102 net_simple.cc:162] Time per operator type: I0912 02:50:33.981837 2642102 net_simple.cc:171] 0.233322 SparseLengthsSum8BitsRowwise [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float16 I0912 02:51:26.748972 2643925 net_simple.cc:162] Time per operator type: I0912 02:51:26.748977 2643925 net_simple.cc:171] 0.106591 SparseLengthsSum [root@fblearner001.01.ftw1 /home/msmelyan]# ./sparse_lengths_sum_benchmark.new.par --iteration 10000 --dtype float I0913 01:39:22.372238 1076874 net_simple.cc:162] Time per operator type: I0913 01:39:22.372244 1076874 net_simple.cc:171] 0.211041 SparseLengthsSum Analysis ======== Our optimized generated code is ~3.5x faster than original code in lengths_reducer_rowwise_8bit_ops.h as shown below. However, our uint8 is about 2x slower than float16 and is on par with float32. There are several reasons for that: 1. uint8 intrudoces extra instructions to multiply by bias and add scaling factors 2. In addition to emebding blocks, we are now also reading scale_bias. For every pair of scale and bias, we bring entire cache line of 64 bytes, whiles only using 8 bytes. 128-wide uint8 input block only occupies 2 cache lines and hence reading nearly entire extra cache lines of useless data adds to bandwidth wastage. 3. In addition, hardware prefetcher runs past the end of the input block and scale_bias cache line, trying to prefetch more useless lines. This effect was characterised in Appendix section of https://fb.facebook.com/notes/jason-lu/sparse-adagrad-performance-optimization-in-model-training/10214810437360961/ To get deeper insights into what is going on, we isolated SparseLengthsSum and SparseLengthsSum8BitsRowwise codes, for float32, float16 and uint8, into a microbenchmark, where we varried block size, while keeping table size constant (256MB) block_size time(uint8) time(float16) time(float32) 64 0.19 0.09 0.17 128 0.12 0.09 0.17 256 0.70 0.09 0.14 1024 0.50 0.06 0.10 The pattern for block size of 64 and 128 is similar to what we observed in sparse_lengths_sum_benchmark. However, we see that as block_size increases (for a fixed table size), time to perform embeddings decreases quite drastically. For block_size of 256 and beyond, uint8 starts achieving speedup over float16. Longer block better amortizes bandwidth wastage due to scale_bias and hardware prefetcher running past the end of the block. Reviewed By: dzhulgakov Differential Revision: D5824641 fbshipit-source-id: 3a5c020294d84874da78c6943e596423393473d6	2017-09-19 10:50:09 -07:00
Xianjie Chen	eccfa1041c	fix cuda GatherOp for empty batch Summary: as title Differential Revision: D5840432 fbshipit-source-id: 5d9021f152c21d24e91dc0cc3d95443782afc228	2017-09-15 17:40:43 -07:00
Dhruv Mahajan	c3fd31b1a2	weights for labels in image_input_op Summary: Introduced weight for labels in multi-lable setting. An extra weight blob is introduced and read in the operator in case lable setting is weighted sparse. Reviewed By: kevinwilfong Differential Revision: D5812467 fbshipit-source-id: efb209092e1e9effc915b0a753fa0c67b47a4fb6	2017-09-15 17:40:42 -07:00
Aapo Kyrola	fb45383ed6	resubmission of PR1175: fp16 BatchMatMul Summary: PR 1175 caused a build error because gemmBatched was only under a specific #ifdef. Now put it outside the #ifdef, and things work. Reviewed By: asaadaldien Differential Revision: D5834868 fbshipit-source-id: 072a64c8f4b259ff7504104121766115b46b8aa0	2017-09-14 21:46:05 -07:00
Jerry Zhang	0e7bd68536	Allow one output for droput at inference time Summary: att Reviewed By: bddppq Differential Revision: D5680214 fbshipit-source-id: 19e731901cb5c9491100c61baefc4b75e6e8b262	2017-09-14 10:46:41 -07:00
Yangqing Jia	f0d0361609	Revert D5794634: [caffe2][PR] fp16: BatchMatMul Summary: This reverts commit 911c462824edec3de529a5a4385a4c437e24bf59 bypass-lint Differential Revision: D5794634 fbshipit-source-id: 1863b02282329cbee6b10e5870f03051b4bb6c58	2017-09-13 18:46:47 -07:00
Luke Yeager	37af6566e1	fp16: LSTMUnit Summary: Was https://github.com/caffe2/caffe2/pull/1151 Closes https://github.com/caffe2/caffe2/pull/1191 Differential Revision: D5825387 Pulled By: akyrola fbshipit-source-id: edb47c8bd7ffb72e1e587a9c5bfee9347e3d587e	2017-09-13 15:47:03 -07:00
Junjie Bai	90ca470d70	Standardize operator argument "is_test" Summary: Also add the ability to mark an argument as required. Added a string constant `OpSchema::Arg_IsTest` for `is_test` arg. If users define the `is_test` argument with `ArgIsTest(...)`, then it automatically becomes required argument, in the meanwhile user can still use `Arg("is_test", ...)` to define an optional `is_test` argument. Reviewed By: akyrola Differential Revision: D5812391 fbshipit-source-id: eaaba50d027813a8012389edc6c459de23c3c728	2017-09-13 14:35:27 -07:00
Luke Yeager	3cfc6f26e7	fp16: BatchMatMul Summary: Was https://github.com/caffe2/caffe2/pull/1151 Closes https://github.com/caffe2/caffe2/pull/1175 Reviewed By: Yangqing Differential Revision: D5794634 Pulled By: akyrola fbshipit-source-id: 911c462824edec3de529a5a4385a4c437e24bf59	2017-09-13 14:35:25 -07:00
Alisson Gusatti Azzolini	c07ebd2396	TrimDataset to ensure size is multiple of number or replicas Summary: For data parallel we need the batch size to be multiple of nubmer of replicas. In order to do so with this diff we do Dataset(rec).trim(multiple_of=num_replicas) Reviewed By: dzhulgakov, harouwu Differential Revision: D5753861 fbshipit-source-id: c5d728b925707dbd3d1f500a93e67e185c223569	2017-09-13 12:17:21 -07:00
Luke Yeager	361bbb8b43	fp16: SumReduceLike Summary: Was https://github.com/caffe2/caffe2/pull/1151 Closes https://github.com/caffe2/caffe2/pull/1183 Differential Revision: D5794704 Pulled By: akyrola fbshipit-source-id: e4dee46f753e9a8663057c81f23028f6246fba02	2017-09-13 11:46:23 -07:00
Sachin Padmanabhan	a198da5583	Added LengthMax Operator to Caffe2 Summary: Added LengthMax operator to Caffe2. Reviewed By: dzhulgakov Differential Revision: D5720124 fbshipit-source-id: 1995fea8e480c9a9f3e054d02801b03c1ce6c51b	2017-09-12 20:01:48 -07:00
Viswanath Sivakumar	583d031754	Operator to compute RoI region coordinates for RMAC Summary: Computes a fixed grid or RMAC region coordinates for a given 4D feature tensor (NCHW) as described in https://arxiv.org/abs/1511.05879. The output is the `roi` format expected by RoIPoolOp. To compute the actual RMAC itself, the output of this op should be passed to RoIPoolOp. Reviewed By: wickedfoo Differential Revision: D5594994 fbshipit-source-id: 5edac98a18137b53555f9a16354419b424679c99	2017-09-12 12:47:17 -07:00
Xianjie Chen	be406b1e5f	Revert D5639080: Caffe2: Cuda implementation for BatchOneHot operator Summary: This reverts commit 8ee280c4bab64c1fdfb7429ee2c9ac8c02933931 bypass-lint Differential Revision: D5639080 fbshipit-source-id: cf522822b7cb5ba9a238ba7837f0f522e1f49b73	2017-09-12 11:51:14 -07:00
Mayank Rana	1c414426df	Caffe2: Cuda implementation for BatchOneHot operator Summary: Cuda implementation for BatchOneHot operator. Reviewed By: lvdmaaten Differential Revision: D5639080 fbshipit-source-id: 8ee280c4bab64c1fdfb7429ee2c9ac8c02933931	2017-09-11 08:24:44 -07:00
Luke Yeager	1cf94854a4	fp16: SequenceMask Summary: Was https://github.com/caffe2/caffe2/pull/1151 Closes https://github.com/caffe2/caffe2/pull/1178 Reviewed By: bddppq Differential Revision: D5794641 Pulled By: akyrola fbshipit-source-id: c3bd99dde74317280a65af7cc7a36a6a734822f6	2017-09-09 13:02:38 -07:00
Luke Yeager	6cf172c60d	fp16: SumSqrElements Summary: Was https://github.com/caffe2/caffe2/pull/1151 Closes https://github.com/caffe2/caffe2/pull/1179 Differential Revision: D5794650 Pulled By: akyrola fbshipit-source-id: 63e7973a88193a3b74ac4ba677df737889cbf0b6	2017-09-08 16:36:51 -07:00
Luke Yeager	1a2b229d47	fp16: add test for FC Summary: fp16 and TensorCore support was already added to the op in https://github.com/caffe2/caffe2/pull/1056. This adds a test. Closes https://github.com/caffe2/caffe2/pull/1182 Differential Revision: D5794698 Pulled By: akyrola fbshipit-source-id: b0d7ef317dfbb9d712b0b4646b38dc600b8434f1	2017-09-08 10:34:34 -07:00
Yan Shang	b6c9ecac7c	Fix shape inference of distance_op Summary: The shape inference of distance_op has issues (only works when inputs are 1D tensors). This diff fix the shape inference and the unit test. Reviewed By: kittipatv Differential Revision: D5788744 fbshipit-source-id: cb1b7facf7b9ccd64b54edca156325eceef50f33	2017-09-07 17:16:46 -07:00
Junjie Bai	176f8f9a19	Make ConvTranspose allow optional bias term Reviewed By: jerryzh168 Differential Revision: D5755702 fbshipit-source-id: a00487ca376d09b68132162c53797f5af052d114	2017-09-07 17:16:43 -07:00
Kittipat Virochsiri	3251c60804	TensorInferenceFunction for Unique Summary: Filling in the gap in tensor inference Reviewed By: sunnieshang, akyrola Differential Revision: D5779550 fbshipit-source-id: 9ec68c9dad566183d7d0fc2819829c2b91430dda	2017-09-06 15:37:11 -07:00
Aapo Kyrola	ceb13bf3fb	Fix cell/hidden init issue, add copy states to test Summary: As title. Wonder this had not been encountered before. Only affects cases where the states are copied over though. Reviewed By: Yangqing Differential Revision: D5777314 fbshipit-source-id: 8aef435c832e4ead5bb3d3e35bb065c734a2af5f	2017-09-06 14:16:17 -07:00
Aapo Kyrola	631971e459	threaded RNN executor for CPU, multi-stream executor CUDA Summary: Special executor for RNNs which can exploit parallelism over timesteps. For CPU we use multi-threading, achiving 3x or so improved on 4-layers LSTMs. With CUDA, perf improvements are more modest, but the structure allows for optimizing it further. For CUDA, we use multiple streams and events if there is parallellism over timesteps. In my experiments, it was not good to use more than 2 streams, though. Flag --caffe2_rnn_executor can be used to switch the executor off. Reviewed By: salexspb Differential Revision: D5749304 fbshipit-source-id: d6f76b3e16598be5b4e8188aff031671ebafaa4c	2017-09-06 12:26:30 -07:00
Dong Li	1104bab796	add axis argument to NormalizeOp and NormalizeGradientOp Summary: As described in task T21337239, NormalizeOp currently normalizes over only the last dimension. In this commit, the following changes have been made: (1) Added an axis-parameter to NormalizeOp in both the CPU and CUDA context. (2) Added the same axis parameter to NormalizeGradient in both the CPU and CUDA context (3) Removed the limit that the original NormalizeOp operator requires the input dimension to be 2 Reviewed By: akyrola Differential Revision: D5745162 fbshipit-source-id: 69e04f59ac4d954b0062c3b2a53c8ca465a1027b	2017-09-05 11:17:32 -07:00
Curtis Huang	c9238671ee	Use char-ngram embedding for out-of-vocabulary words Summary: Description Provide DeepText model with the functionality to load a secondary index (pre-trained char-ngram embedding, e.g. FastText) during training/test. Embeddings of out-of-vocabulary words will be computed on-the-fly during training/test by averaging the char-ngram embeddings. Approach This diff provides two custom operators to accomplish this task – ConditionalOp and IndexCharNgramGetOp. We first use IndexCharNgramGetOp to perform char-ngram index lookup and return a sparse tensor segmented by lengths for each token. The sparse tensor is then used to compute the average embedding provided by the char-ngram index. Finally, we use a ConditionalOp to replace those whose embeddings were not found in the original index during the feature apply stage. Please refer to documentations of the code for more details. Reviewed By: jamesr66a Differential Revision: D5666924 fbshipit-source-id: f76605d093154a014d5b9ebf9510de9d79874eee	2017-09-01 19:16:49 -07:00
James Cross	53ccbd9a6e	soft-coverage attention Summary: Implementation of a new variant of attention module, which contains a recurrent decoder state with vectors corresponding to each source-side word and strictly increasing values, thus enabling it to model the degree to which source words have been translated. The approach is a variant of the approaches described in https://arxiv.org/pdf/1601.04811.pdf. We simply include the sum of all previous attention weights for encoder words as a new recurrent state (coverage_t). A new linear transform on encoder_outputs is used to produce coverage_weights, which has the same dimensionality as encoder_outputs, and implicitly models the fertility of source-side words (and putting this extra information strain on the encoder network). Thus the encoder output, the decoder state, and the coverage weights have the same dimensionality for a given source word, and attention logits are calculated as v * tanh(coverage * coverage_weights + encoder_output + decoder_state). Note: the entire coverage state for each translation instance is of shape (encoder_length, coverage_units), but the states for the RecurrentNetwork operator, used to train the decoder, must be flat in the data dimension. This state is therefore initialized with shape (encoder_length * coverage_units) [not shown in the open-source library] and reshaped appropriately within the apply_soft_coverage_attention() function. Differential Revision: D5593617 fbshipit-source-id: 7d0522b5eb0b26f22e8429e4461a459f2f16ed46	2017-08-31 21:21:54 -07:00
Jerry Zhang	debceaff02	Support new arguments in ConvTranspose Summary: Adding support to use kernels, strides, pads etc. as arguments. Reviewed By: houseroad Differential Revision: D5710699 fbshipit-source-id: 8b63af4c4a76cd06b637a376aeb29a34c659be2e	2017-08-31 11:17:32 -07:00
Kittipat Virochsiri	4ec26d23a7	TensorInference function for LengthsSum and such Summary: Adding missing tensor inference function Reviewed By: kennyhorror Differential Revision: D5735119 fbshipit-source-id: 1602b5aeec95f13a3c3c6d3e5417af2712a4dfbb	2017-08-31 09:32:48 -07:00
Misha Smelyanskiy	080fab8f6c	Code generator for and high-performance emebding look-up kernels, supporting Summary: Code generator for and high-performance emebding look-up kernels, supporting Sum, WeightedSum, and Mean reducers. Achieve at least 1.5x speedup on float and over 2x speedup for float16, compared to existing code These are results on Broadwell, using sparse_lengths_sum_benchmar.par benchmark Old ============== [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 Preparing lookup table. 2017-08-08 00:10:23.101848 Preparation finished. 2017-08-08 00:10:27.955680 I0808 00:10:27.955732 30700 net.cc:177] Starting benchmark. I0808 00:10:27.955759 30700 net.cc:178] Running warmup runs. I0808 00:10:27.956367 30700 net.cc:188] Main runs. I0808 00:10:31.839035 30700 net.cc:199] Main run finished. Milliseconds per iter: 0.388264. Iters per second: 2575.56 I0808 00:10:35.704169 30700 net.cc:233] Operator #0 (indices, Python) 0.0583264 ms/iter I0808 00:10:35.704210 30700 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.327694 ms/iter I0808 00:10:35.704213 30700 net.cc:237] Time per operator type: I0808 00:10:35.704217 30700 net.cc:246] 0.327694 SparseLengthsSum I0808 00:10:35.704221 30700 net.cc:246] 0.0583264 Python [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 --dtype float16 Preparing lookup table. 2017-08-08 00:10:59.047159 Preparation finished. 2017-08-08 00:11:05.140565 I0808 00:11:05.140612 31725 net.cc:177] Starting benchmark. I0808 00:11:05.140635 31725 net.cc:178] Running warmup runs. I0808 00:11:05.141104 31725 net.cc:188] Main runs. I0808 00:11:08.371510 31725 net.cc:199] Main run finished. Milliseconds per iter: 0.323039. Iters per second: 3095.6 I0808 00:11:11.671450 31725 net.cc:233] Operator #0 (indices, Python) 0.0609876 ms/iter I0808 00:11:11.671489 31725 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.26856 ms/iter I0808 00:11:11.671494 31725 net.cc:237] Time per operator type: I0808 00:11:11.671497 31725 net.cc:246] 0.26856 SparseLengthsSum I0808 00:11:11.671500 31725 net.cc:246] 0.0609876 Python New (Misha's) ============== [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 Preparing lookup table. 2017-08-07 23:44:55.897748 Preparation finished. 2017-08-07 23:45:00.708896 I0807 23:45:00.708945 4178361 net.cc:177] Starting benchmark. I0807 23:45:00.708971 4178361 net.cc:178] Running warmup runs. I0807 23:45:00.709444 4178361 net.cc:188] Main runs. I0807 23:45:03.608551 4178361 net.cc:199] Main run finished. Milliseconds per iter: 0.289909. Iters per second: 3449.36 I0807 23:45:06.536182 4178361 net.cc:233] Operator #0 (indices, Python) 0.0572399 ms/iter I0807 23:45:06.536224 4178361 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.23512 ms/iter I0807 23:45:06.536228 4178361 net.cc:237] Time per operator type: I0807 23:45:06.536232 4178361 net.cc:246] 0.23512 SparseLengthsSum I0807 23:45:06.536236 4178361 net.cc:246] 0.0572399 Python [root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par --iteration 10000 --dtype float16 Preparing lookup table. 2017-08-07 23:45:17.191579 Preparation finished. 2017-08-07 23:45:23.173668 I0807 23:45:23.173715 4179316 net.cc:177] Starting benchmark. I0807 23:45:23.173743 4179316 net.cc:178] Running warmup runs. I0807 23:45:23.174090 4179316 net.cc:188] Main runs. I0807 23:45:24.939749 4179316 net.cc:199] Main run finished. Milliseconds per iter: 0.176564. Iters per second: 5663.67 I0807 23:45:26.698885 4179316 net.cc:233] Operator #0 (indices, Python) 0.0557303 ms/iter I0807 23:45:26.698923 4179316 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.119794 ms/iter I0807 23:45:26.698927 4179316 net.cc:237] Time per operator type: I0807 23:45:26.698931 4179316 net.cc:246] 0.119794 SparseLengthsSum I0807 23:45:26.698935 4179316 net.cc:246] 0.0557303 Python Reviewed By: salexspb Differential Revision: D5582172 fbshipit-source-id: d71f5a55580b734a51b8f30852b75f379acfdaf2	2017-08-30 16:22:11 -07:00
Ahmed Taei	5315669bd8	Add ShapeInference for ConcatOp (Fixed) Reviewed By: akyrola Differential Revision: D5721442 fbshipit-source-id: 64ed35cb4c40f32a5cca29fe9cd04e18a340db4b	2017-08-29 12:18:03 -07:00
Aapo Kyrola	488abdcd6c	slice op shape inference Summary: As titled + test Reviewed By: jamesr66a Differential Revision: D5720637 fbshipit-source-id: eae76e587808139fcf06abc0f8345152979815ec	2017-08-29 11:05:24 -07:00
Aapo Kyrola	7c7603a60e	fix FC shape inference Summary: FC shape inference was broken for non-default axis. Add test. Reviewed By: asaadaldien Differential Revision: D5720146 fbshipit-source-id: f36f9cc8477dc61c3b07eeea8ea0702562045c88	2017-08-28 16:08:07 -07:00
Yangqing Jia	9f693b39aa	Revert D5711951: [caffe2] Add shape inference for ConcatOp Summary: This reverts commit 9173ef0f18af25326ec18e66f6ce29eecfa5ceea bypass-lint Differential Revision: D5711951 fbshipit-source-id: 9bbb872eafcbd3c470b782a5ddb2a1c894888101	2017-08-25 23:37:38 -07:00
Ahmed Taei	da418f5744	Add shape inference for ConcatOp Reviewed By: akyrola Differential Revision: D5711951 fbshipit-source-id: 9173ef0f18af25326ec18e66f6ce29eecfa5ceea	2017-08-25 18:09:35 -07:00
Jerry Zhang	3c180ba317	Opensourcing channel shuffle Summary: att Reviewed By: Yangqing Differential Revision: D5662540 fbshipit-source-id: 474d7d808841ff8f7ce97b55df836b9d2f4a7629	2017-08-25 16:46:31 -07:00
Alexander Sidorov	7eba614503	RNNCell: Initializers interface, simplify _LSTM helper Summary: _LSTM helper is a legacy piece we had before all the RNNCell awesomeness landed. Now we need to pull it apart and create separate building blocks that people can use for any RNNs. Please note changes to a test with double scoping. That should go away once we change RNNCell scoping logic in such a way that each cells ads its own name to the scope for all of its outputs (see another diff: D5613139 ) Reviewed By: jhcross Differential Revision: D5632276 fbshipit-source-id: 1cb568ab995c4c0b3dd1b4bad2d028e34bded9c1	2017-08-25 12:01:24 -07:00
Aapo Kyrola	82360d8cba	shape inference for ReduceFront/Back/Sum/Mean, Gather and Dropout Summary: These were missing and required for some seq2seq models. Unit tested. The previous implementation of ReduceBackMean shape inference was incorrect, so removed it. Reviewed By: asaadaldien Differential Revision: D5691262 fbshipit-source-id: 76f868b298440f988635966a410f0232301ca6c4	2017-08-25 11:31:17 -07:00
Alisson Gusatti Azzolini	5e0b28e7bd	PrependDimOp Summary: Split the first dimension of a tensor into 2, the first of which is fixed and given in the argument. This is used to then split batch into smaller batches and distributed it across workers. Reviewed By: harouwu Differential Revision: D5702175 fbshipit-source-id: 02bb93e49bf9db411b516e149c8e647301dd2ca5	2017-08-24 18:52:05 -07:00
Jiyan Yang	20c854d43c	Make FC op work with empty batch in cuda Reviewed By: xianjiec Differential Revision: D5673458 fbshipit-source-id: d1c950c94173843670ae1fae0e15ff61ca7d6761	2017-08-24 18:52:04 -07:00
Jerry Zhang	7f4ceb83e3	Relax dimension constraints for weight matrix in FC Summary: att Reviewed By: Yangqing Differential Revision: D5662265 fbshipit-source-id: 893ee2f92debab06117725beeca3199cba565f1e	2017-08-24 11:16:39 -07:00
Catherine Dong	1955d0797e	Added fast path for CUDNN global max pooling Summary: This adds a fast path for global max pooling with NCHW. Compared to equivalent ReduceBackMean, this is about 3.5x faster. Based on D5533059. Reviewed By: akyrola Differential Revision: D5681122 fbshipit-source-id: 7a4df934044c7dd01888f095f7dd46654aaf4eae	2017-08-23 16:33:06 -07:00
Alisson Gusatti Azzolini	930acc8e85	CUDA SparseLengthsWeightedSum Summary: title. Reviewed By: harouwu Differential Revision: D5665776 fbshipit-source-id: a8ae1a71a9a21e68172662f38b5f799870b9dcd1	2017-08-22 15:42:02 -07:00
Junjie Bai	5748e7140f	Strip Operator Schema in mobile build Reviewed By: Yangqing Differential Revision: D5677792 fbshipit-source-id: d29edb26a36b24a46821e13e2d77af0f21571fcd	2017-08-22 13:31:08 -07:00
Douglas Chen	440d979075	Optimizations for Caffe2 SinusoidPositionEncodingOp Summary: Optimizations for SinusoidPositionEncodingOp to sinusoid position embeddings more competitive against table based embeddings. - Removed most calls to std::pow - Replaced division with multiplication with reciprocal - Reused computation across examples within a batch Current speedup with batch size of 16, sequence length of 128 and embedding size of 512 is about 270x (17k embeddings per second -> 4.7M embeddings per second). The speedup is very dependent on the batch size; at a batch size of 4 this only gets 1.7M embeddings per second. Profile: https://pxl.cl/8zf0 Annotated DoRunWithType: P57925031 Reviewed By: jamesr66a Differential Revision: D5634766 fbshipit-source-id: 0f35bb176164ea547c91de242a0205c5d7adf7cf	2017-08-22 00:04:06 -07:00
Zhicheng Yan	0e20a7cb7d	ImageInputOp_more_data_augmentation Summary: Add more data augmentation to ImageInputOp 1) Inception-style random sized cropping 2) color jittering 3) color lighting Reviewed By: panshen1 Differential Revision: D5637726 fbshipit-source-id: 45d9cc69eec9f4d48c1607d80ccd89e325961b1a	2017-08-19 14:15:58 -07:00
Eider Moore	d6632a9a05	Adding a range operator similar to np.arange Summary: Adding a range operator in the spirit of np.arange. It is an imporant building block for a lot of manipulation functions. This accepts parameters with the same meaning in the same order as python's range or np.arange (e.g. `(stop)`, `(start, stop)` or `(start, stop, step)`) Differential Revision: D5616861 fbshipit-source-id: 02622b8bd85ebca125cc881c06fae5b54b7c602a	2017-08-18 14:45:56 -07:00
Philipp Keller	d617a77433	Add tests for ConcatOp and SplitOp Summary: The new test ensures 'add_axis' and 'split' arguments work as intended for tensors of various dimensions. Hypothesis should checks various edge cases like zeroes in 'split_info' and 1D input with axis=0, add_axis=1. Reviewed By: hoangmit Differential Revision: D5645778 fbshipit-source-id: 061f9511a082da54e5c1bbe53a0e7096af4b8d1b	2017-08-18 14:02:42 -07:00
Chonglin Sun	5f612d9740	GPU version of BatchGatherOp Summary: GPU version of BatchGatherOp. Reviewed By: azzolini Differential Revision: D5613593 fbshipit-source-id: 0e4a35b84db852ac2718868a02fa90e7c3d8f1f0	2017-08-17 18:31:10 -07:00
James Reed	f388135d3f	Layer norm brew wrapper Summary: Implement a brew wrapper for the LayerNorm op. This adds the scalar weight and bias terms to the op. Reviewed By: jmp84 Differential Revision: D5595836 fbshipit-source-id: 467b2e1158b0c454a149d4b26c47719826e98752	2017-08-17 11:17:47 -07:00
James Reed	e45e621b0e	Implement layer norm gradient GPU Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf Reviewed By: wickedfoo Differential Revision: D5594445 fbshipit-source-id: 873643165c958fd5829fa7cf07d5d4b1b8b0ed59	2017-08-17 11:17:46 -07:00
James Reed	8e8e90f595	IMplement layer normalization backward CPU Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf Reviewed By: jmp84 Differential Revision: D5578306 fbshipit-source-id: 94d262f0317b3ee1b504e0110ad5135afe8350ca	2017-08-17 11:17:46 -07:00
James Reed	e16c40eb4f	Implement layer normalization op forward GPU Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf Reviewed By: wickedfoo Differential Revision: D5552262 fbshipit-source-id: d0cddb0769623a1b3779e2114c19e6ebc57c0f0d	2017-08-17 11:17:45 -07:00
James Reed	474c043be5	Implement layer normalization op forward CPU Summary: Implement layer normalization from https://arxiv.org/pdf/1607.06450.pdf Reviewed By: akyrola Differential Revision: D5543381 fbshipit-source-id: 1102e568439af6a60aad3b87017d5a997fb7dc16	2017-08-17 11:17:44 -07:00
Aapo Kyrola	e89474c496	fix forward_only mode Summary: Forward-only mode had broken at some point. Two things: RNNCell did not pass the parameter to recurrent.py and also recurrent.py was broken if forward_only=True after python3 codemod. Added test to rnn_cell_test to actually check the forward only parameter is passed to prevent future breakage. Reviewed By: jmp84 Differential Revision: D5639306 fbshipit-source-id: b1bbc39d59c3f3734b2f40a1c2f3740c733e0bd4	2017-08-17 10:19:04 -07:00
Jerry Zhang	a63e7314f3	Adding 1d-2d-3d Schemas for Conv and Pool Summary: Add Conv and Pool operators with dimensions. Reviewed By: bddppq Differential Revision: D5588614 fbshipit-source-id: 2552c40dc3ca180a6ab51817d60f0b85b97885d5	2017-08-17 09:45:54 -07:00
Jerry Zhang	4ca5735753	Allow inplace for spatial_bn_op Summary: att Reviewed By: Yangqing Differential Revision: D5644717 fbshipit-source-id: 1a020fe4ca7028056ce7bebddb7bfd1437998530	2017-08-17 09:18:55 -07:00
Badri Narayan Bhaskar	ae2aad9c0d	Operator to Merge ID_LIST features Summary: As an alternative to sharing embeddings, we want to explore merging the ID_LISTs in the net. This commit adds an operator to merge many ID_LIST features into a single one. Differential Revision: D5481523 fbshipit-source-id: 446121122a32de5682d5d75a165370bc8d776d03	2017-08-17 01:16:00 -07:00
Jingfei Du	b3029df1d0	Added window mode for caffe2 sequence operator Summary: This can be used for local attention to mask elements outside of a window Reviewed By: jamesr66a Differential Revision: D5643677 fbshipit-source-id: 92b33866258ccc7307d5bcf08234610aa3fb152d	2017-08-16 21:34:29 -07:00
Kevin Wilfong	1f47a80e88	Caffe2: diagonal fill op Summary: Caffe2: diagonal fill op Reviewed By: panshen1 Differential Revision: D4775640 fbshipit-source-id: bb388ffe223e6b153d4cde1fdad6f84a2bb65b0f	2017-08-16 13:05:11 -07:00
Aapo Kyrola	a53192e334	Revert D5001637: [Caffe2][RNN] Threaded dependency-aware RNNExecutor (frontier/diagonal execution). Summary: This reverts commit 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8 bypass-lint Differential Revision: D5001637 fbshipit-source-id: 4d6250ae7e66ea0aa635a68d943d552e5db65b69	2017-08-16 03:21:49 -07:00
Aapo Kyrola	453c60ce28	Threaded dependency-aware RNNExecutor (frontier/diagonal execution). Summary: This diff adds dependency-aware concurrent/parallel execution of operators in stepnets. For CPU, we use multi-threaded execution. For CUDA, we use multiple streams and cuda events for parallelism and dependency tracking. Much of the diff is about computing dependency graph, which was quite tricky because we need to also avoid write-races of multiple operators running in multiple timesteps in parallel. Also, recurrent blobs "change name" when passing over timestep ("_prev"), so that needs to be handled as well. This diff also restores the link-ops that I unlanded earlier. The performance gain of this diff is very good for CPU (same perf as with static_dag, even better on forward-only). On CUDA, the gains are modest, at least with the sizes i was testing with. Reviewed By: salexspb Differential Revision: D5001637 fbshipit-source-id: 3d0a71593d73a9ff22f4c1a5c9abf2a4a0c633c8	2017-08-15 23:55:15 -07:00
James Reed	a985355935	Gradient for SequenceMaskOp Summary: Implement backward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72. Reviewed By: akyrola Differential Revision: D5618373 fbshipit-source-id: b831fa69f51d9468c858961f922564159e12b46f	2017-08-12 14:34:29 -07:00
James Reed	0a828768e9	Implement SequenceMaskOp forward pass Summary: Implement forward pass for a SequenceMaskOp to replace https://github.com/caffe2/caffe2/blob/master/caffe2/python/attention.py#L54-L72. This implements two modes: a sequence-length based mode and a matrix triangle mode. Reviewed By: akyrola Differential Revision: D5615493 fbshipit-source-id: a2ce4a8e655d9b720049010a7856be052c5567eb	2017-08-12 14:34:28 -07:00
Jerry Pan	9372ff7a86	Caffe2: support Tensor in BlobsQueueDB Summary: Caffe2: support Tensor in BlobsQueueDB Reviewed By: kevinwilfong Differential Revision: D5589616 fbshipit-source-id: 66aa6092b6403960c4858abd986771b58be94106	2017-08-11 11:21:14 -07:00
Alexander Sidorov	a7be496fe2	Revert D5589309: modify _LSTM into _RNN to adapt GRU Summary: This reverts commit f5af67dfe0842acd68223f6da3e96a81639e8049 bypass-lint Differential Revision: D5589309 fbshipit-source-id: 79b0a3a9455829c3899472a1368ef36dc75f6e14	2017-08-10 16:42:41 -07:00
Christopher Hay	f2dfb40302	Added amplitude argument to SinusoidPositionEncodingOp Summary: In order to control the absolute scale/magnitude of the output of this op, added a tuning parameter: amplitude Reviewed By: jamesr66a Differential Revision: D5596574 fbshipit-source-id: 3b7e316de55cce6fd686da70aa5658ec3e99b070	2017-08-10 15:27:17 -07:00
Kittipat Virochsiri	eb85258beb	CreateMapOp Summary: Add operator to create empty map Reviewed By: xianjiec Differential Revision: D5454652 fbshipit-source-id: ecad6cc58572b378962af08cf02063ef546ed58f	2017-08-09 13:32:19 -07:00
Tao Wu	7b86a34610	modify _LSTM into _RNN to adapt GRU Summary: GRU is different than LSTM that it only has hidden states but no cell states. So in this case, reusing the code of _LSTM is problematic, as we need to delete the part of creating cell state, and change many other places that use hard-coded 4 (hidden_all, hidden, cell_all, cell) into 2 (hidden_all, hidden). Otherwise GRU will break during the backward pass, when the optimizer tries to apply gradient to each of the parameters, because cell state is never used, so it does not have gradients for the corresponding parameters (i.e., cell_state_w, cell_state_b). Differential Revision: D5589309 fbshipit-source-id: f5af67dfe0842acd68223f6da3e96a81639e8049	2017-08-09 13:24:45 -07:00
Andrei Chtcherbatchenko	a2204f0b1e	Caffe2: Write CUDA version of OneHot operator Summary: This diff implements CUDA version of OneHot operator. Reviewed By: bddppq Differential Revision: D5578543 fbshipit-source-id: 55b70e8ec6ee34b647b9140fecbba31b6968f403	2017-08-08 18:17:39 -07:00
Jianlong Zhong	152d2ae3a8	Implement CUDA version of GRU operator Summary: Add CUDA version of GRU operator Reviewed By: jamesr66a Differential Revision: D5571043 fbshipit-source-id: 332aa64fc8a9116cc33382f2b2907080e58c13b3	2017-08-08 10:57:40 -07:00
Chonglin Sun	8ad382df3c	implement LengthsTopK operator Summary: It was reverted previously because of lack of schema for gradient op. Added it back and resend. difference between this diff and previous reverted diff: 1. added schema for gradient operator 2. change line:95 in kmax_pooling_op.h from CAFFE_ENFORCE to CAFFE_ENFORCE_GE Reviewed By: xianjiec Differential Revision: D5568867 fbshipit-source-id: 39813b389a5da803967a561249793afdfce00c58	2017-08-07 18:19:29 -07:00
Ahmed Taei	8af625ede2	Implement gradients for Col2Im and Im2Col operators Reviewed By: jay-mahadeokar Differential Revision: D5576385 fbshipit-source-id: a0ca4f704fd861f7cc67079041b1d0772fc66920	2017-08-07 15:51:30 -07:00
Ben Zhang	42fb87d0b1	L1Distance Row-wise, instead of cumulative Summary: The L1Distance operator used to return a single value denoting the L1 of the entire input, instead of a vector for each input value. This fixes that. Reviewed By: Yangqing Differential Revision: D5570385 fbshipit-source-id: fbab0e0c9262ccbdb3af27262b8baacdeb2d0fc9	2017-08-07 14:09:25 -07:00
Zhicheng Yan	e7192c3b91	image_input_op_dense_multi_label Summary: To train an image model, we also can use label embedding vector as supervision as opposed to using SoftmaxLoss/SigmoidCrossEntropyLoss. In such case, the label is a dense vector. This diff enables such use cases. Reviewed By: panshen1 Differential Revision: D5556203 fbshipit-source-id: 52c61495e02fab457dc2d43e3345d7dbd5580ab7	2017-08-07 12:38:16 -07:00
Juan Miguel Pino	4d8a8c2e1e	Implement dot attention Summary: Implement dot attention as described in https://arxiv.org/abs/1508.04025 This saves the computation of weighted encoder outputs in `rnn_cell.py` When the encoder and decoder dimensions are different, we apply an FC, which corresponds to the general case below Figure 2. Refactored unit tests. Reviewed By: jhcross Differential Revision: D5486976 fbshipit-source-id: f9e9aea675b3b072fbe631bc004199b90a9d95cb	2017-08-06 11:50:16 -07:00
Jerry Pan	fac241bcbc	Caffe2: add a DB that's wrapped around a BlobsQueue as an adapter for data from non-DB interface Summary: Caffe2: add a DB that's wrapped around a BlobsQueue as an adapter for data from non-DB interface. This is useful for bridging the gap between DB interface data processing ops (TensorProtosDBInput, ImageInputOp etc.) and data that's coming from arbitrary Python or the pretty intricate Hive reader. Reviewed By: akyrola Differential Revision: D5554560 fbshipit-source-id: 01bb0056410f9ade205367d5fefc721f91f5b629	2017-08-06 11:50:14 -07:00
Szymon Piechowicz	12f25c8106	Revert D5545533: [pairatt] implement kMaxPooling operator Summary: This reverts commit 8378caaac528a71c154067168787ed493bfb0d37 bypass-lint Differential Revision: D5545533 fbshipit-source-id: a8d9db807f5b22461b21b7589886cf54861e3757	2017-08-04 01:33:29 -07:00
Yiming Wu	8e1ecb1cfd	async sparse length sum op Summary: This diff makes SparseLengthsSum(Gradient) Async. It goes through these logics: 1. Adding INDICES to Gradient op input so that we can make it async without device host copies. 2. Registering new 3 input op as gradient for CPU/GPU version of SLS 3. In order to not breaking old nets(they are mostly on cpu), I still register the old 2 input op. So the op schema will not complain when it encounter some old nets that has SLSGradient op in it. wickedfoo Sorry this diff might bring you extra work of migrating your optimization effort to this new async gradient op. But we think it is worth it. :( Reviewed By: dzhulgakov Differential Revision: D5423188 fbshipit-source-id: 62494a6c52a507c4a4688d5a9e1a2bc720d5370d	2017-08-03 03:04:15 -07:00
Christopher Hay	a4e6ca6956	Added Sinusoidal Position Encoding Op Summary: Added caffe2 operator to calculate the sinusoidal position encoding for word embeddings, as described on page 6 in https://arxiv.org/abs/1706.03762. Reviewed By: jamesr66a Differential Revision: D5533024 fbshipit-source-id: 1afb35cd7f9d8c71f2635b853e56b2c840f0bc1f	2017-08-03 01:46:46 -07:00
Chonglin Sun	4a8545e3c6	implement kMaxPooling operator Summary: used by attention model Differential Revision: D5545533 fbshipit-source-id: 8378caaac528a71c154067168787ed493bfb0d37	2017-08-03 00:48:34 -07:00
Honghao Wei	cb1dd21280	adding operator lp_norm to support calculating l1 norm and l2 norm Summary: Implement operators LpNorm, which is to calculate the Lp norm of a tensor for regularization(p=1or 2) . Currently, there are only operator L1Distance to calculate the l1 distance of two same-shape tenors. We want to make it take only one input and output the l1 loss. We would do the same for l2 loss. We also plan to implement l_{p,q} loss, but have not decided which p and q to take. Reviewed By: xianjiec Differential Revision: D5460051 fbshipit-source-id: d67a38fbc94afa52de26d4a53e4d2b7df3c50b6a	2017-08-02 15:09:08 -07:00
Aapo Kyrola	ab42a95b6f	fast path for CUDNN global average pooling Summary: KaimingHe debugged slow model, and found out that global average pooling was hideously slow, even with CUDNN. Turns out CUDNN pooling op (especially backward pass) is not optimized for global pooling. This adds a fast path for global average pooling with NCHW. This is about 30x faster than CUDNN with 56 x 56 pooling, Compared to equivalent ReduceBackSum, this is about 3x faster. I will bootcamp the max pooling. Reviewed By: asaadaldien Differential Revision: D5533059 fbshipit-source-id: 2d590693d737fa92184603663031d96f6145f304	2017-08-02 11:10:10 -07:00
Junjie Bai	0c7ee02c37	Add CUDA implementation of BooleanUnmask and fixed some bugs in the test Reviewed By: akyrola Differential Revision: D5405606 fbshipit-source-id: fd755ee2ec3d742597f7f5500f54caa396db4da4	2017-08-01 16:51:40 -07:00
Kevin Wilfong	60cb55461e	Caffe2: Support additional outputs in ImageInputOp Summary: This allows users to add an arbitrary of additional outputs to ImageInputOp. These are populated by reading additional TensorProto values from the TensorProtos from the DBReader, and converting them into Tensors. Similar to labels, only ints and floats are supported, and multiple values are supported. Reviewed By: panshen1 Differential Revision: D5502019 fbshipit-source-id: 5a8b61b3a8549272a112e8e02cd613d8f9a271ba	2017-08-01 14:36:05 -07:00
Tao Wu	5d304a3b49	add gradient for SparseToDenseMask operator Summary: add gradient for SparseToDenseMask operator Reviewed By: kittipatv Differential Revision: D5320792 fbshipit-source-id: 8ee7f1c87e8270ad6077ed197ce9512524069b59	2017-08-01 13:05:03 -07:00
Aapo Kyrola	e38015756a	shape inference for Squeeze Summary: Add tensor inference function for squeeze, refactor a bit Reviewed By: asaadaldien Differential Revision: D5518880 fbshipit-source-id: 5b8cb9154f5f777d4be3612a96d7ed76a9068c0c	2017-07-31 16:04:24 -07:00
Tao Wu	6530db49bc	improve pair_wise_loss operator to support multiple sessions Summary: The diff adds support for rank_loss operator to support computing loss for multiple sessions (batch). Reviewed By: kittipatv Differential Revision: D5515465 fbshipit-source-id: 55a01cd5ad21eaeae82875ad136c392fed0dbb26	2017-07-28 15:12:47 -07:00
Dmytro Dzhulgakov	f2090debb0	Optimized SparseLengthsSum Summary: Optimised SparseLengthsSum (fp32) for now 1) Specialized reducer 2) created fast routine with prefetches, loop unrolling, block specailization and register tiling 3) added more variety of block sizes to segment_ops_test.py Reviewed By: Yangqing Differential Revision: D5392472 fbshipit-source-id: 8ed9baf1b12ec05bd391cabb390024e6bc60a6f6	2017-07-28 10:10:25 -07:00
Bangsheng Tang	a41cbdec0e	float support for square root divide Summary: to support an operation needed by D5507205 Reviewed By: xianjiec Differential Revision: D5512522 fbshipit-source-id: a9b3a668c28eff71d1e106dbbb572184df4a7638	2017-07-27 17:40:40 -07:00
Bangsheng Tang	d8443b8ffa	BatchGatherOp Summary: 1. added BatchGatherOp and BatchGatherGradientOp 2. unit tests Reviewed By: xianjiec Differential Revision: D5443965 fbshipit-source-id: bdcbb7f9f91c55484372a4bdb1727ae6d49e2018	2017-07-27 10:17:42 -07:00
Ahmed Taei	40b783b746	Fix flaky test due to numerical gradient approximation error. Summary: Use smaller step size for GradientChecks and pass seed to help reproducing the test from logged inputs. Reviewed By: Yangqing Differential Revision: D5505698 fbshipit-source-id: fc308efe72d535695ba628944aee1913ba16b2f1	2017-07-26 18:58:19 -07:00
Wojciech Glogowski	8f8dccd2ed	distance_op_test from hypothesis_test refactored Summary: Moved distance_op_test from hypothesis_test to distance_op_test and refactored Reviewed By: akyrola, asaadaldien Differential Revision: D5495104 fbshipit-source-id: 4a90c75eabeb380ae9d150d6258e9b5b0fbfc5ca	2017-07-26 13:37:08 -07:00
Dmytro Dzhulgakov	cf1ce29631	Fix GPU SparseAdaGrad with empty tensors Summary: CUDA doesn't like 0-sized grids :) Reviewed By: Yangqing Differential Revision: D5495805 fbshipit-source-id: 6819513024978ee6bb70a39b25d23ced06465750	2017-07-25 23:50:54 -07:00
Artem Volkhin	2f5c96a730	Fix Flatten operator for empty tensors Reviewed By: xianjiec Differential Revision: D5487475 fbshipit-source-id: f1321e15352b0bbe039312f544a9c2ed78da8732	2017-07-25 17:51:42 -07:00
Tao Wu	5449afa855	use model.create_param instead of using param_init_net directly Summary: When creating parameters for modelhelper, we should use create_param instead of using param_init_net and model.params directly. The diff rewrite some of these cases in rnn_cell.py in order to make model._parameter_info and model.params consistent. Reviewed By: kittipatv Differential Revision: D5477724 fbshipit-source-id: 28c4aaf8f98d9d89125af6a42ad328008f0079e1	2017-07-24 21:17:24 -07:00
Dmytro Dzhulgakov	8930c095c1	Add support for int32 indices in SparseLengthSum and friends Summary: Need it for some reference comparison for c2isl. Also there's an argument that it might be faster on GPU with int32. Doesn't seem to be the case now, but haven't tested with Jeff's changes yet. Reviewed By: kennyhorror Differential Revision: D5405482 fbshipit-source-id: dc1a983dce5f06f1111c5634ec475647c94848cc	2017-07-24 17:50:00 -07:00
James Cross	0eda7955bd	use internal cell for DropoutCell output prep methods Summary: In order to get dimensions right, correctly identify gradients, etc., DropoutCell should call the _prepare_output and _prepare_output_sequence methods of its internal cell for its own such methods. This bug was identified by NVIDIA intern Syed Tousif Ahmed. Reviewed By: akyrola Differential Revision: D5483082 fbshipit-source-id: f6df5b4a0502ed0771056638aab219fb5cc7d964	2017-07-24 14:53:11 -07:00
Yangqing Jia	0deee2194f	Add a quick SparseLengthsSum benchmark. Summary: TSIA - this makes it a bit easy to benchmark sparse lengths sum. Reviewed By: dzhulgakov Differential Revision: D5477844 fbshipit-source-id: 89e25c5e0dbf3538877ba1a9abc75a10abfa2757	2017-07-24 13:17:47 -07:00
James Cross	99e79a616b	attention with encoder_lengths Summary: For RNN attention, we should not include the invalid parts of the encoder output (based on encoder_lengths) in the computation. This diff accomplishes that by forcing logits for those positions to be negative infinity. Note that the this step can be bypassed by passing encoder_lengths=None, which is what we do for beam search, thus incurring no extra overhead for inference. Reviewed By: jamesr66a Differential Revision: D5402547 fbshipit-source-id: 1863d6050b5129e4df829c6357f0aa9ded0715dc	2017-07-23 10:06:01 -07:00
Junjie Bai	efe2d01a3e	Fix some bugs in CPU version of BooleanMask and add GPU version Reviewed By: akyrola Differential Revision: D5397208 fbshipit-source-id: 0314cc181e315f3b6cda846292b2e2ea73bb015b	2017-07-21 11:38:49 -07:00
Geet Sethi	2dc8851206	RNN Workspace Blob Extraction Summary: Added operator RecurrentNetworkBlobFetcherOp that takes as input a scratch workspace name and prefix, and copies over all blobs in the scratch workspace into the global workspace. This essentially extracts all intermediate recurrent network computation for each timestep. Added a wrapper in recurrent.py - retrieve_step_blobs(net, prefix='rnn') - which, when called after an rnn is run, will return a list of all blobs extracted from the net. Reviewed By: akyrola Differential Revision: D5421926 fbshipit-source-id: 0f35b466d77d3c719fb0e32de7dbcafc6c0d5225	2017-07-17 10:24:18 -07:00
Robert Verkuil	97193478c7	Implemented GRUCell Summary: Implemented python logic and tests to create an RNNCell for GRU. Uses the preexisting GRU Unit Op code. Reviewed By: salexspb Differential Revision: D5364893 fbshipit-source-id: 2451d7ec8c2eacb8d8c9b7c893bfd21b65fb9d18	2017-07-10 17:52:25 -07:00
Robert Verkuil	2409c2e359	GRUUnit Op Backwards Pass Summary: Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell. Functions were created to mimic LSTM implementation as closely as possible. Backwards pass implementations are defined in GRU_unit_op.{h, cc} assertGradientChecks call added to gru_cell_test.py Reviewed By: salexspb Differential Revision: D5364856 fbshipit-source-id: 09cff4478091827763b40cc331e4e0abf0ec258f	2017-07-10 17:52:24 -07:00
Robert Verkuil	279f3f095e	Implemented Gated Recurrent Unit (GRU) c++ operator forward pass Summary: Just an implementation of the forward pass of the GRU Unit Op, not the full RNNCell. Functions were created to mimic LSTM implementation as closely as possible. Implementation defined in GRU_unit_op.{h, cc} tests put in gru_cell_test.py, which import rnn_cell_test_util.py for sigmoid, tanh, and _prepare_rnn functions. Reviewed By: jamesr66a Differential Revision: D5363697 fbshipit-source-id: f9ba9fe0be01ffc868dd22027be8be4975b84998	2017-07-10 17:52:23 -07:00
Robert Verkuil	48bd102b95	Moved sigmoid, tanh, and _prepare_lstm (renamed) to a util file. Summary: Moved sigmoid, tanh, and _prepare_lstm (renamed) to a util file. Also renamed _prepare_lstm to _preapare_rnn since it is being used for both setting up and LSTM and GRU model. The reason for this commit is to allow the creation of GRU Op and testing code without copying and pasting code for sigmoid, tanh, and setting up an rnn unit op mode. Reviewed By: jamesr66a Differential Revision: D5363675 fbshipit-source-id: 352bd70378031f1d81606c9267e625c6728b18fd	2017-07-10 17:52:22 -07:00
Kevin Matzen	c096c188c3	minor leaky relu bug fixes Summary: numpy.random.rand generates samples from [0, 1) and therefore, the leaky relu test cases weren't testing negative inputs. Tests still pass after change. Leaky relu can be used in-place, but gradient took X rather than Y. Technically, the result is no different as it's just used for a sign test in the gradient, but updated it to take Y to reduce confusion. Differential Revision: D5390126 fbshipit-source-id: d0c428abbb2797eb33902a7d2a2f59d5e85daaa6	2017-07-10 16:04:45 -07:00
Junjie Bai	ff3996acb9	Add NormalizeL1Op for doing L1 nomalization along given axis Reviewed By: salexspb Differential Revision: D5380220 fbshipit-source-id: 38fc56a1013c25b0c8b0fc161ca54fea412fb8b2	2017-07-10 10:10:36 -07:00
Bangsheng Tang	5f63f5697a	IndexHash Summary: 1. IndexHashOp 2. Helper class SparseFeatureHash 3. FeatureSpec changes to add desired_hash_size Reviewed By: kennyhorror Differential Revision: D5361370 fbshipit-source-id: bf02e3ca12b3654f1d291f77c8af9248b6c4ac55	2017-07-07 23:06:11 -07:00
Geet Sethi	86b6a6e2f8	Added PiecewiseLinearTransform CUDA Op Summary: Added a CUDA implementation of the PiecewiseLinearTransformOp. Differential Revision: D5378537 fbshipit-source-id: 38857f59f5cc52e16e1ecc97983a0b0b82a46c74	2017-07-07 15:20:00 -07:00
Clément Godard	cb7f17ab64	added gradients for ResizeNearest (CPU + CUDA) and ref Summary: # Added the gradients of the operation for both CPU and CUDA kernels. # Unified variable names across all ops. # Added reference implementation in numpy. # The gradient check needs a larger stepsize to succeed, is that normal? Reviewed By: akyrola Differential Revision: D5313682 fbshipit-source-id: aceb92649e01c5caeba8774e678f9095502d396c	2017-07-07 14:19:42 -07:00
Tao Wu	5aa147f273	added PackRNNSequence and UnpackRNNSequence operators Summary: Added two operators that can be used to tranfer data into the input format of RNN and back. Reviewed By: kittipatv Differential Revision: D5329886 fbshipit-source-id: 07eac29416427b08c49989d4eeed50a6f18493a1	2017-06-30 09:53:31 -07:00
Thomas Dudziak	5355634dac	Dict fixes/improvements and unittest targets for Python 3 in caffe2 core Summary: As title Reviewed By: salexspb Differential Revision: D5316104 fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30	2017-06-29 17:05:41 -07:00
Andrew Tulloch	6c67a753c7	Fix test_pair_wise_loss_predictions Summary: Increase absolute error tolerance. Reviewed By: tomdz Differential Revision: D5349604 fbshipit-source-id: 8e04001b0b6a6e83083f341e265ab3c0d2b06918	2017-06-29 12:48:04 -07:00
Luke Yeager	c3b4d277bf	Tests: fix test_convolution_sync() Summary: This bug in the test was exposed by https://github.com/caffe2/caffe2/pull/861 (previously, the test was always using the cuDNN engine, regardless of the value of `engine`). This bug is now blocking https://github.com/caffe2/caffe2/pull/817. ``` ____________________ TestConvolution.test_convolution_sync _____________________ ... if use_cudnn and requested_engine != 'CUDNN': raise ValueError( > 'When use_cudnn=True, the only engine you can specify is ' E ValueError: When use_cudnn=True, the only engine you can specify is "CUDNN" ``` https://travis-ci.org/caffe2/caffe2/jobs/247605579 Closes https://github.com/caffe2/caffe2/pull/881 Differential Revision: D5332619 Pulled By: akyrola fbshipit-source-id: 63737768a155359ddbbef1da424fcbb94f86bd4e	2017-06-27 18:07:04 -07:00
James Cross	08cfc72dee	Increase threshold for test_unroll_attention Summary: To 0.000001. Reviewed By: salexspb Differential Revision: D5323697 fbshipit-source-id: 5a06c8f5e719b5252e4229704205be37777a8bab	2017-06-27 17:17:32 -07:00
James Reed	07ba98b4b2	Allow specification of SliceOp dimensions via argument rather than via tensor Summary: This should make it so we no longer have super hacky DAG chains just to generate vectors of indices that could be specified at model creation time Reviewed By: akyrola Differential Revision: D5316707 fbshipit-source-id: 97bb3868b69e0c5a7f465c95f2e16ae0485dcc56	2017-06-27 17:17:32 -07:00
Luke Yeager	dfd745a4d1	Conv frontend: checking engine and use_cudnn Summary: Fixes https://github.com/caffe2/caffe2/issues/860 Raise an exception when the user specifies conflicting values for `engine` and `use_cudnn` in the conv frontend. Closes https://github.com/caffe2/caffe2/pull/861 Differential Revision: D5329587 Pulled By: akyrola fbshipit-source-id: 0f1ced9a88c9c6c5a7cb30a070e5bf60129082f0	2017-06-27 09:47:48 -07:00
Luke Yeager	ca2bf16009	Tests: handle missing python-lmdb gracefully Summary: Fix issue mentioned here: `875a9850c1 (commitcomment-22773221)` Unblocks https://github.com/caffe2/caffe2/pull/817 /cc tomdz Closes https://github.com/caffe2/caffe2/pull/871 Differential Revision: D5329573 Pulled By: akyrola fbshipit-source-id: 855294f76bce82dce6d4bd489244922799848076	2017-06-27 09:47:46 -07:00
Zhicheng Yan	c0445c4426	support_multi_label Summary: Extend image_input_op to support multi-label binary label vector Reviewed By: panshen1 Differential Revision: D5318119 fbshipit-source-id: da6757ed9a562f1ab58e3ae5642b7a70d6d499c1	2017-06-27 08:47:59 -07:00
James Reed	24e30534ea	Implement SliceGradientOp for CPU Summary: Implement slice gradient for CPU. Will soon port this over to GPU so NMT can use it Reviewed By: akyrola Differential Revision: D5309305 fbshipit-source-id: 8fb5f4e665f236ecce9227c5c0c302f5076b01ad	2017-06-26 21:18:05 -07:00
Andrew Tulloch	cb5af39c69	Vectorize CPU ClipOp implementation (and add test) Summary: Noticed this wasn't vectorized, could be handy. Reviewed By: kennyhorror Differential Revision: D5308593 fbshipit-source-id: c2b35ece34831f0546f010a1ebe0b89f1a7d9446	2017-06-26 11:33:13 -07:00
Luke Yeager	553e4ec20d	Refactor conv_test - no cuDNN+dilation+NHWC Summary: Place all the cuDNN version checks in a helper function. Easier to use in future tests and update for newer versions of cuDNN in one place. Fixes this error in `test_convolution_gradients`: ``` RuntimeError: [enforce fail at conv_op_cudnn.cc:519] status == CUDNN_STATUS_SUCCESS. 9 vs 0. , Error at: /data/caffe2/caffe2/operators/conv_op_cudnn.cc:519: CUDNN_STATUS_NOT_SUPPORTED Error from operator: input: "X" input: "w" output: "Y" name: "" type: "Conv" arg { name: "stride" i: 1 } arg { name: "pad" i: 0 } arg { name: "order" s: "NHWC" } arg { name: "dilation" i: 2 } arg { name: "kernel" i: 1 } device_option { device_type: 1 } engine: "CUDNN" ``` Closes https://github.com/caffe2/caffe2/pull/839 Reviewed By: salexspb Differential Revision: D5292123 Pulled By: akyrola fbshipit-source-id: 513cc742be73c29ffe24e9e964845a217405a73d	2017-06-26 09:20:07 -07:00
James Cross	29887f556f	Unrolled test for AttentionCell Summary: Adding a test to check computational integrity of networks constructed with AttentionCell using UnrolledCell. Reviewed By: salexspb Differential Revision: D5306915 fbshipit-source-id: 02acfd1011f7d3ee5fac21cc2778c4a486190c43	2017-06-25 17:21:24 -07:00
Clément Godard	fd86c51c39	Add ResizeNearest Summary: Added the CUDA implementation of ResizeNearest (forward pass only) Reviewed By: wickedfoo Differential Revision: D5290087 fbshipit-source-id: 4291e65b2b4b6a1a197275d5ed8710f40000b59e	2017-06-23 15:49:42 -07:00
Thomas Dudziak	342de07231	Core unit test fixes for Python 3 Summary: As title Differential Revision: D5291327 fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf	2017-06-23 13:22:16 -07:00
Aapo Kyrola	667b8347a2	stabilize softmax_ops_test Summary: softmax_ops_test occasionally fails with gradient checks. Stabilize by setting the numpy random seed. Also reduce some dimensions for the large input test to make it run faster. Reviewed By: harouwu Differential Revision: D5292106 fbshipit-source-id: a21eec89e18d30ac7c5609dacf5d413e841841a6	2017-06-22 13:50:32 -07:00
Ahmed Taei	a531d74dde	ELU CUDA implementation Reviewed By: wickedfoo Differential Revision: D5290111 fbshipit-source-id: 95bd0b5467fe064f2fe1b21cb8ec31f150b35e3f	2017-06-21 11:47:13 -07:00
Luke Yeager	d46fe736c8	Fix flaky test in dataset_ops_test.py Summary: ``` while pytest caffe2/python/operator_test/dataset_ops_test.py::TestDatasetOps::test_collect_tensor_ops; do sleep 0.1; done ``` Run this long enough and you'll see an error like this: ``` Sample histogram: [ 92 109 65 103 99 104 99 125 100 104] ... > self.assertTrue(all(hist > 0.7 * (num_to_collect / 10))) E AssertionError: False is not true ``` I've seen values like 65, 68, 69, 70. Setting the cutoff at 60 instead of 70 seems safe enough. /cc Yangqing (or whoever authored `a56b881c4a`). Closes https://github.com/caffe2/caffe2/pull/840 Differential Revision: D5292120 Pulled By: akyrola fbshipit-source-id: 2ea4cbb58e206268759bd9d3639e8921623f519c	2017-06-21 05:35:44 -07:00
Luke Yeager	005156f6b4	Fix gradient checking for softplus op Summary: kmatzen why did you set the stepsize in `ff84e7dea6`? The test is flaky before this change. Solid afterwards. Closes https://github.com/caffe2/caffe2/pull/841 Differential Revision: D5292112 Pulled By: akyrola fbshipit-source-id: c84715261194ff047606d4ec659b7f89dac3cbb1	2017-06-21 05:35:43 -07:00
Luke Yeager	5e084a9112	Don't require pydot for Python tests Summary: Working towards https://github.com/caffe2/caffe2/pull/817. ``` > graph = pydot.Dot(name, rankdir=rankdir) E AttributeError: 'NoneType' object has no attribute 'Dot' ``` https://travis-ci.org/caffe2/caffe2/jobs/243867951 Closes https://github.com/caffe2/caffe2/pull/827 Differential Revision: D5276691 Pulled By: akyrola fbshipit-source-id: 047ee869c029002ace29d84c6b56534b7f23f87b	2017-06-19 23:02:00 -07:00
Aapo Kyrola	a5c45e18b5	MaxGradientOp for CUDA + unit test Summary: As title. Pretty straightforward. Could actually run each kernel in parallel, but we can optimize later if needed. Reviewed By: Yangqing Differential Revision: D5278415 fbshipit-source-id: 29f59afe28f37fc4152ec7eb7cd6c1ab65f2cb8c	2017-06-19 22:35:45 -07:00
Dmytro Dzhulgakov	a6fcecaa71	Allow AliasOp to work on empty tensor Summary: Ran into it while working on a dper benchmark. Apparently it works harmless even with empty tensors. Reviewed By: akyrola Differential Revision: D5273672 fbshipit-source-id: a968ae03a659d6c1a215f12cc35f7ba68448e833	2017-06-19 15:24:02 -07:00
Wael Abdelghani	4b4022ded7	Make test_lstm_main more stable Summary: Title Reviewed By: Yangqing Differential Revision: D5268569 fbshipit-source-id: f79c38376ef2dd0684fd438668b0762341d982cf	2017-06-19 12:36:29 -07:00
Luke Yeager	932cf9eb92	Fix entropy error coming from utility_ops_test Summary: Working towards https://github.com/caffe2/caffe2/pull/817. `E InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(20, 12, 22), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 43253760.` https://travis-ci.org/caffe2/caffe2/jobs/243867951 /cc kittipatv Closes https://github.com/caffe2/caffe2/pull/830 Differential Revision: D5276639 Pulled By: akyrola fbshipit-source-id: 0c21be25ecd931837dc8b0c2cc17048f531350d1	2017-06-19 12:09:32 -07:00
Jeff Johnson	3f860af050	Implement TopKOp for GPU Summary: This is a real implementation (not GPUFallbackOp) of the TopKOp for GPU. There are two algorithm implementations: -for k <= 512, it maps to a warp-wide min-heap implementation, which requires only a single scan of the input data. -for k > 512, it maps to a multi-pass radix selection algorithm that I originally wrote in cutorch. I took the recent cutorch code and removed some cutorch-specific things as it made sense. Also added several utility files that one or the other implementations use, some from the Faiss library and some from the cutorch library. Reviewed By: jamesr66a Differential Revision: D5248206 fbshipit-source-id: ae5fa3451473264293516c2838f1f40688781cf3	2017-06-17 08:47:38 -07:00
James Reed	21dc425e07	Optimize SumSqrElementsOp for CUDA Summary: The old version used one block with 128 threads. Throughput was too low for the NMT use case (calculating squared gradient norms for every parameter), so this increases the throughput. Shaves 7% off CNN model training time per step Reviewed By: wickedfoo Differential Revision: D5263748 fbshipit-source-id: adc3bacd11e49ea00c60381d613d993050e899be	2017-06-16 17:03:38 -07:00
Simon Layton	176a841087	Fixes for CuDNNDropoutOp Summary: Closes https://github.com/caffe2/caffe2/pull/809 Differential Revision: D5263514 Pulled By: akyrola fbshipit-source-id: 1f1e5bdb6fa551cb1f9beb3e5d3ad9c0c8813ed0	2017-06-15 22:51:12 -07:00
Kittipat Virochsiri	fc2a8d045c	adding flatten indices output to TopK Summary: This makes it easier to gather top-K by group of rows. This is useful in the situation where we want to pick up top-K from batch of fixed length sessions. Let `N` be number of sessions, and `M` be number of examples in a sessions. We would have a batch of `N * M` rows. We can reshape the score blob to `N x M`, and use it as input to `TopK` to select top score for each session. However, without the new output, it's would be inconvenient to gather the rows corresponding to the top scores. The indices are in `[0, K-1)` range. The new output can be used directly as input to `Gather`. Reviewed By: chocjy Differential Revision: D5171459 fbshipit-source-id: 69f7b41456c3f9670650ae07afc8fef8328485e9	2017-06-15 15:32:29 -07:00
Luke Yeager	84cc82cf3f	Fix stats_ops_test Summary: The global StatRegistry doesn't get reset when the workspace is reset. ``` > self.assertTrue(len(workspace.FetchBlob('k3')) == 2) E AssertionError: False is not true ``` https://travis-ci.org/lukeyeager/caffe2/jobs/240162665 /cc azzolini NOTE: this error doesn't show up if you just run `stats_ops_test.py` directly. It shows up when you run other tests in the same session before this test: ``` pytest -v caffe2/python/ ``` Closes https://github.com/caffe2/caffe2/pull/788 Differential Revision: D5259232 Pulled By: salexspb fbshipit-source-id: 3c72633af6bb61c4fda62195298b1e9574b4cbef	2017-06-15 15:07:57 -07:00
Dmytro Dzhulgakov	e9cba7e69f	Option to read from dataset indefinitely. Summary: Useful for benchmarking Reviewed By: kdub0 Differential Revision: D5226758 fbshipit-source-id: 6f3e6dd256f2c40ab71e598a7ce47cd06099adff	2017-06-15 15:07:53 -07:00
James Reed	d9d89b191d	implement SliceOp for GPU Summary: Implementation of the SliceOp for CUDA Reviewed By: akyrola Differential Revision: D5254287 fbshipit-source-id: 0a1660e1aa161fd088a2d8f886e019c05a1919a2	2017-06-15 14:34:34 -07:00
Luke Yeager	f61e4ca070	Fixes in tests to support numpy >= 0.12 Summary: ``` File "/data/caffe2/install/caffe2/python/hypothesis_test.py", line 1911, in test_batch_to_space (w + 2 * pad) / block_size).astype(np.float32) File "mtrand.pyx", line 1404, in mtrand.RandomState.randn (numpy/random/mtrand/mtrand.c:19843) File "mtrand.pyx", line 1534, in mtrand.RandomState.standard_normal (numpy/random/mtrand/mtrand.c:20368) File "mtrand.pyx", line 167, in mtrand.cont0_array (numpy/random/mtrand/mtrand.c:6127) TypeError: 'float' object cannot be interpreted as an index ``` ``` File "/data/caffe2/install/caffe2/python/operator_test/tile_op_test.py", line 101, in tile_ref tiled_data = np.tile(X, tuple(dims)) File "/data/caffe2/venv/local/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 881, in tile return c.reshape(shape_out) TypeError: only integer scalar arrays can be converted to a scalar index ``` I also tested to make sure this still works with 0.11. Closes https://github.com/caffe2/caffe2/pull/787 Differential Revision: D5248087 Pulled By: salexspb fbshipit-source-id: eff69482a8eabb8ace330003fa326c832b53865f	2017-06-15 14:17:20 -07:00
Aapo Kyrola	7bf4c0e0fb	support RNNs in ExtractPredictorNet Summary: We need to support RNNs explicitly in ExtractPredictorNet, because they store sub-nets as strings in special arguments. When netdef argument arrive, we can generalize this a bit. Added a test under rnn_cell_test to test that extracting an LSTM predictor net works correctly and sets the device option properly for the step net ops. Reviewed By: yqwangustc Differential Revision: D5236334 fbshipit-source-id: cd653427f8c440a14d94195a532d18276f94749a	2017-06-14 22:32:29 -07:00
haracejacob	2ec294a8bb	Fix a few typos and grammars in comment Summary: Fix a few typos and grammars in comment by using language-check, python library spell_checker source code is here : https://github.com/17-1-SKKU-OSS/011A/blob/master/spell_checker/spell_checker.py here is the text file which indicates what things should be fixed : https://github.com/17-1-SKKU-OSS/011A/tree/master/spell_checker/fix/caffe2 Closes https://github.com/caffe2/caffe2/pull/719 Differential Revision: D5165118 Pulled By: aaronmarkham fbshipit-source-id: 7fb8ef7a99d03cd5fd2f9ebdb01b9865e90fc37b	2017-06-14 18:22:39 -07:00
Ahmed Taei	94d42b03fb	MaxReduction ops GPU implementation. Summary: Move rowwise-max kernel from Softmax to math_util library and implement colwwise-max kernel and MaxReduction ops. Reviewed By: akyrola Differential Revision: D5240329 fbshipit-source-id: a07281a877324de459aace33ff21175a68cfd8f6	2017-06-14 11:02:46 -07:00
Bokai Cao	0f787a01bc	map operator (move maptrait def out of class) Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed. Differential Revision: D5224449 fbshipit-source-id: 2f60754ed3ba6ed16039c09019117ae3c3646ab2	2017-06-12 14:52:04 -07:00
Bokai Cao	e01769ece5	map operator Summary: added an operator that converts key/value blobs into a blob containing a map pointer, unittest passed. Differential Revision: D5166513 fbshipit-source-id: 748527c423a163fe55f914c08fff3adfc74a540c	2017-06-09 15:17:29 -07:00
Alexander Sidorov	df72826ead	Static RNN Summary: Static RNN allows to unroll an RNN into Caffe2 graph using all existing cell abstractions. In this diff I introduce several new tests that already caught a few bugs in our RecurrentNetworkOp gradient accumulation logic by comparing it to an unrolled version. Another use case is perf - potentially we can run an unrolled net faster because DAGNet will have access to the whole graph. Same about memonger. But this work is not part of this diff Reviewed By: akyrola Differential Revision: D5200943 fbshipit-source-id: 20f16fc1b2ca500d06ccc60c4cec6e81839149dc	2017-06-08 17:48:48 -07:00
Luke Yeager	52ee7697f4	Fixing broken Python tests Summary: `brew_test.py` is just plain broken. `core_test.py` doesn't work with pytest. `apmeter_test.py` and `top_k_test.py` don't work for CUDA builds. Closes https://github.com/caffe2/caffe2/pull/765 Differential Revision: D5211817 Pulled By: Yangqing fbshipit-source-id: 78ec5af35a3fa870978e4c9590210ade9e3bc5ac	2017-06-08 13:34:46 -07:00
Luke Yeager	75f1da327d	Skip Python tests which require opencv or lmdb Summary: Neither dependency is required by the core Python modules. OpenCV, in particular, is a pain to install (no pip package). Conditionally skipping this test will make TravisCI integration easier. Closes https://github.com/caffe2/caffe2/pull/739 Differential Revision: D5211799 Pulled By: Yangqing fbshipit-source-id: c6bdc8a17977f64f34e968fd9ab8c65161d2624d	2017-06-08 13:34:43 -07:00
Ran Xian	4316fb4876	Implement APMeter op Summary: Implements an APMeter operator (APMeterOp) to calculate AP for multilclass classification given prediction socres and labels. The Op takes a score tensor [nsamples x nclasses] and a label tensor [nsamples x nclasses], and outputs a float tensor of size nclasses as the AP for each class. Reviewed By: akyrola Differential Revision: D5082565 fbshipit-source-id: ae7304bc8fc999c361245b9aec38eb9a5f5eef4b	2017-06-07 15:03:04 -07:00
Thomas Dudziak	d524d5b481	Fixes zip/izip for Python 3 Summary: As title Reviewed By: salexspb Differential Revision: D5154186 fbshipit-source-id: 2ef24557d82ae16d3bdfbc90a4cc96be8e2dc6c3	2017-06-07 00:04:26 -07:00
Thomas Dudziak	60c78d6160	Fixes range/xrange for Python 3 Summary: As title Differential Revision: D5151894 fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638	2017-06-07 00:04:26 -07:00
Ahmed Taei	4c5d101caf	Implement ColwiseMax and RowwiseMax reduction ops. Differential Revision: D5192949 fbshipit-source-id: e7e877b4bea19dd1be94449d45d2733f4858b8e7	2017-06-06 21:17:29 -07:00
Wenyi Huang	7723129d14	Add gradient for topK op Summary: Input of topK op: X (dense) Output of topK op: Value and Indices (sparse representation) Value will have gradient in some cases, We backprop (copy) the gradient from sparse (d Value) to dense (d X) Differential Revision: D5133461 fbshipit-source-id: 7bad55b60e8a22dfe0e51357ce2099d7f752c133	2017-06-06 14:20:06 -07:00
Luke Yeager	d8d1cd1064	Test smaller tensors in segment_ops_test Summary: It's causing problems inside docker containers: `InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(5, 9, 10, 5), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 18432000.` Closes https://github.com/caffe2/caffe2/pull/707 Differential Revision: D5162621 Pulled By: Yangqing fbshipit-source-id: 55544210961cbc80828dca2cbeba6a5ace8cf8d1	2017-05-31 20:17:31 -07:00
Luke Yeager	e2cf007dc8	Avoid numpy VisibleDeprecationWarning in test Summary: This warning becomes an error with https://github.com/numpy/numpy/pull/6271 (`>=0.12.0`). ``` caffe2/python/operator_test/tile_op_test.py::TestTile::test_tilewinput /opt/caffe2/caffe2/python/operator_test/tile_op_test.py💯 VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future dims[axis] = tiles /usr/lib/python2.7/dist-packages/numpy/lib/shape_base.py:873: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future return c.reshape(shape_out) ``` Closes https://github.com/caffe2/caffe2/pull/710 Differential Revision: D5160776 Pulled By: Yangqing fbshipit-source-id: b264e0e389de5817a289db878c15e655f9fa2f09	2017-05-31 20:01:30 -07:00
Aapo Kyrola	96d8ae2163	Make fills work with input_shape when run in CUDAContext Summary: If ConstantFill (or other fill op) is used in CUDAContext, with input_as_shape, the code crashes as it expects the shape be in CUDAContext but accesses the array in host code... We could fix this by copying the values from the CUDA tensor, but it is probably best to enforce the shape param is in CPU context. This is what this diff does. Differential Revision: D5152766 fbshipit-source-id: 0629a189bd1d800c0b7c9dbc324b78d279efac0b	2017-05-30 20:47:16 -07:00
Thomas Dudziak	47e921ba49	Remove map() and filter() in favor of comprehensions Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand Reviewed By: akyrola Differential Revision: D5142049 fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc	2017-05-30 15:32:58 -07:00
Luke Yeager	0a9684c3b9	Mark in-place GPU dropout as broken, add test Summary: I'll let y'all decide how you want to fix this (probably need a persistent curand buffer). Here's a test to verify the fix. Closes https://github.com/caffe2/caffe2/pull/495 Differential Revision: D5148815 Pulled By: akyrola fbshipit-source-id: e80dabe65230ddd32340f2d872cd8786ac960bf8	2017-05-30 12:35:22 -07:00
Luke Yeager	a47652379f	Fix SparseAdagrad for indices.ndim>1 Summary: Same fix as https://github.com/caffe2/caffe2/pull/249, but for SparseAdagrad. Also update the tests for both ops to test this functionality. Closes https://github.com/caffe2/caffe2/pull/675 Differential Revision: D5148750 Pulled By: akyrola fbshipit-source-id: d30b722429bc547fd53400c1a29e4ee9e2e6ed18	2017-05-30 12:02:18 -07:00
Luke Yeager	16b240145a	Fixing some tests Summary: As dzhulgakov said at https://github.com/caffe2/caffe2/pull/227#issuecomment-295084443, it would be nice to avoid this stream of CPU-only test fixes. The second fix could have been avoided if tests were run on TravisCI. I think the TravisCI infra could be greatly improved if we used ccache like your colleagues at PyTorch: https://github.com/pytorch/pytorch/pull/614. Would you be interested in a PR which does this? Closes https://github.com/caffe2/caffe2/pull/547 Differential Revision: D5147405 Pulled By: akyrola fbshipit-source-id: 5e9a4571d364c5f0ed8a5e216c9b6136dd4d10be	2017-05-30 09:16:48 -07:00
Anmol Kalia	7f98dc28cb	Refactored spatial softmax Summary: Refactored SoftmaxWithLoss by removing the code for spatial=1 mode and created a new op SpatialSoftmaxWithLoss that has the spatial mode implemented. Reviewed By: viswanathgs Differential Revision: D5104120 fbshipit-source-id: 8ab999e32c916b2a39a670a7b2a3365401535f24	2017-05-26 14:50:43 -07:00
Aapo Kyrola	d60a2e3c58	UnsortedSegmentSum/Mean for CUDA Summary: To make optimizer for sparse gradients work with CUDA, we need UnsortedSegmentSum and Mean implemented for CUDA. Unique was already implemented by harouwu. Pretty straightforward implementations, should be fast enough -- and i don't know a faster way anyway. Added some tests as well. Reviewed By: asaadaldien Differential Revision: D5124548 fbshipit-source-id: 63ae72f45fc2f07470603f7b2de12f34635dbb3d	2017-05-26 09:33:49 -07:00
Mohamed Fawzy	e35a4fe5cc	Implement SizeOp as requested in github issue#583 Summary: Implement SizeOp that returns the number of elements in the input tensor. Output is 1D tensor that contains the number of elements Reviewed By: akyrola Differential Revision: D5101061 fbshipit-source-id: d1c56053b6f3b41c65ac574dd748482775d1ea0d	2017-05-25 11:07:35 -07:00
Aapo Kyrola	f2303ccb77	fix tileop test Summary: Gradient test for tile op was flaky because i had made the dimensions too large. This caused push blocking errors. Also I noticed my test_grad_tile was incorrect. Reviewed By: asaadaldien Differential Revision: D5126476 fbshipit-source-id: ae9ce5d9041648d7a4535fc88d4013e669bd6f02	2017-05-24 18:32:01 -07:00
James Cross	c39f6cf2d0	gradient accumulation fix Summary: As noted by salexspb, MultiRNNCell had unreliable gradient computation. The problem was that recurrent gradient and gradient computed wihtin the backward step net were not being accumulated during the backward pass, but rather writing to the same blob, thus overwriting each other. This diff fixes that by artificially introducing an extra blob for the internal output, and then accumulating it into the gradient coming from the recurrent connection. Reviewed By: salexspb Differential Revision: D5110059 fbshipit-source-id: 16add50989fe8866361bbc21afce5f214c5292fd	2017-05-24 10:33:32 -07:00
Aapo Kyrola	2b11adb414	TileOp CUDA fix: number of threads must be hard coded Summary: I had "optimized" the number of threads / block, but cub::BlockReduce has a static template parameter for the number of threads, and this must match. Probably tests still passed because typically the initial numbers are zeros. Also added a stronger test. Thanks ves for the report. Differential Revision: D5110901 fbshipit-source-id: c1169b1286e204c202b0727448ddb51b4965eacb	2017-05-23 09:32:19 -07:00
Ahmed Taei	09bbd0382c	ConvNd cuDNN Summary: Add ConvND cuDNN implementation. Reviewed By: akyrola Differential Revision: D4702205 fbshipit-source-id: 65275bcff3970b0d43ac5c168d38bcd075985979	2017-05-19 15:20:33 -07:00
Yiming Wu	65750349ba	deprecate CNNModelHelper in python/operator_test dir Summary: deprecate CNNModelHelper in python/operator_test dir BTW I found that there is 2 mkl_speed_test. I am confused... Reviewed By: salexspb Differential Revision: D5094122 fbshipit-source-id: f6526f4de334f2245eb4c1f204a8ec9f23750d78	2017-05-19 12:17:17 -07:00
Ahmed Taei	32bf7a2c2b	Generalize PoolingOp(cuDNN) to compute 2D and 3D pooling. Reviewed By: akyrola Differential Revision: D5090689 fbshipit-source-id: f9f11e12adc0ee8db088f3397a8c33aa31eb5deb	2017-05-19 10:19:00 -07:00
Pooya Davoodi	307459eb62	Fix conv_test for CUDNN dilated convolution in NHWC Summary: CUDNN dilated convolution was added to V6. This version of CUDNN does not support NHWC for dilated convolution. Fix conv_test.py so that it does not test CUDNN for dilated convolution in NHWC format. Closes https://github.com/caffe2/caffe2/pull/598 Reviewed By: akyrola Differential Revision: D5084835 Pulled By: asaadaldien fbshipit-source-id: 3c0c5ed02c5d9232fca567e387ab6260d71e5aaf	2017-05-18 10:07:28 -07:00
James Reed	85f1d947dd	Vectorize SigmoidOp on CPU Summary: I noticed that Sigmoid was taking an inordinate amount of time in our NMT benchmark, so I looked at the implementation and it didn't seem optimal. I replaced the implementation with an Eigen version so that when the Eigen update goes through, we will get proper AVX(2) vectorization. Differential Revision: D5082464 fbshipit-source-id: aa951f7d730fc05198f7dd04076ec58d471b74c8	2017-05-17 20:33:36 -07:00
Ben Zhang	12edbcb154	Implemented L1Distance Operator for CUDA Summary: Added L1Distance Operator for CUDA, as well as tests. Reviewed By: bwasti Differential Revision: D5071966 fbshipit-source-id: 4c3d862605e9123d955bf091efa67d0731bd816a	2017-05-17 17:32:53 -07:00
Pieter Noordhuis	bbd7aee9ab	Revert D4952993: [Caffe2] fix mkl_sparse and migrate sparsity experiments Summary: This reverts commit 86c03676ab4e47f04d2d0dd438a4a1c849bbbff0 Differential Revision: D4952993 fbshipit-source-id: 5c213c48ac44ce6aefccacc6d80534648d3c516a	2017-05-17 14:46:56 -07:00
Yiming Wu	f359d70ae7	fix mkl_sparse and migrate sparsity experiments Summary: Migrate experiments folder to fb/sparse folder. Keep FunHashOp and SparseFunHashOp because they are now assumed as a default Op in depr. What I did # Migrate FunHashOp and SparseFunHashOp and their unitests to core-caffe2, make sure tests are passed. # Migrate other Ops in experiment folder to fb/sparse folder. Write new TARGETS files for them. Make sure tests are passed. # Make sure all related tests passed. # Fix MKL definition btw. Make sure that FC_Sparse is not compiled when there is no MKL support Reviewed By: salexspb Differential Revision: D4952993 fbshipit-source-id: 86c03676ab4e47f04d2d0dd438a4a1c849bbbff0	2017-05-16 18:33:51 -07:00
Ben Zhang	93f1d0ca7c	L1 Operator Summary: Adds the L1 Distance operator to distance_op. Reviewed By: bwasti Differential Revision: D5007719 fbshipit-source-id: fd547c6645cf5f87305e9ebfd95ed918779c1d2a	2017-05-11 18:03:10 -07:00
Ahmed Taei	8df51a84ac	Support 3D&1D SpatialBatchNorm[CPU] Summary: Generalize SpatialBatchNorm CPU Op to compute Spatial batch normalization for 1D, 2D & 3D input tensors. Reviewed By: dutran Differential Revision: D5043563 fbshipit-source-id: 7fcb933a628dd47f13aa622f63601a87382f09cd	2017-05-11 09:32:54 -07:00
Romain Cledat	e16ea46013	Extended ImageInputOp Summary: Added several features to the ImageInputOp: - bounding box (per image as well as default for the operator). For per-image, it only works in Caffe2 format and is passed as the third tensor in the form (ymin, xmin, height, width). For the operator, pass bounding_xmin, bounding_ymin, bounding_width and bounding_height as parameters. - per-channel mean/std. You can use the usual mean/std to pass a single value to be used for all channels or also pass mean_per_channel and std_per_channel to specify different values per channel. Order of channels is BGR. - A minimum size parameter that can be specified instead of the scale parameter. The minsize parameter will only scale the image if it is smaller than required. This differs from scale which will scale up as well as down. You can only specify one of scale or minsize. Added a test case to test some of the features Differential Revision: D4874988 fbshipit-source-id: 437191052a46e9916defe8b100d7cc7864373f61	2017-05-10 17:52:01 -07:00
Yury Zemlyanskiy	3abd0cb623	Add axis argument to SoftmaxWithLoss Summary: ##axis## argument for SoftmaxWithLoss (it doesn't yet work for spatial case). Reviewed By: akyrola Differential Revision: D5025797 fbshipit-source-id: 9e3cf39223af3f2c8bb357f8d9fe952b7349f913	2017-05-09 19:36:00 -07:00
Simon Layton	1d0ba2cfbd	New cudnn ops Summary: cuDNN versions of dropout and LRN (for native fp16 support), port of Caffe's max pooling algo that uses an explicit mask to store locations (also supports fp16 storage) Closes https://github.com/caffe2/caffe2/pull/396 Reviewed By: akyrola Differential Revision: D4990880 Pulled By: asaadaldien fbshipit-source-id: a716acffb656843e9b31e3e6808bd2d8aa959d03	2017-05-08 16:33:21 -07:00
Yury Zemlyanskiy	11052d03aa	RNNCell API change: returns states and outputs Summary: Incorporating definition of cell's output and illustraing it's usage by adding dropout to all types of cell. I think that we should try to get rid of aliases in RecurrentNetwork, so output of applied_over_sequence is also always (state_1_all, state_2_all, ...). This way we can merge get_output_from_single_step, get_output_from_sequence and get_outputs_with_grads into a single method Let me know what do you think! Reviewed By: jhcross Differential Revision: D4992913 fbshipit-source-id: 737939be336ad145f84e8733cd255d4f7188ef70	2017-05-08 15:19:48 -07:00
Kevin Matzen	0cb7774445	softplus op Summary: Added softplus function, f(x) = ln(exp(x) + 1) Reviewed By: akyrola Differential Revision: D5011057 fbshipit-source-id: 5fddb1568fee625f81ea3a86a85d0f400c3ee278	2017-05-08 10:40:25 -07:00
Jon Morton	ac1c63dda8	Add specialized ResizeNearest implementation for scale=2 Summary: Specialized implementation of ResizeNearest for width_scale=2 and height_scale=2. This implementation doesn't use divides or calls to std::min, and is unrolled 2x over the width dimension. Also add a correctness test. About 6x faster. Reviewed By: ajtulloch Differential Revision: D4928579 fbshipit-source-id: 5cc92a52bd688690fee907b4333d9c84b666f9c9	2017-05-07 21:10:11 -07:00
Du Tran	033ab9da1b	Adding video data layer for caffe2 Summary: Adding a simple video data layer which allows to read video data from frames, videos and output 5D tensor. It also allows multiple labels. The current implementation is based on ffmpeg Differential Revision: D4801798 fbshipit-source-id: 46448e9c65fb055c2d71855447383a33ade0e444	2017-05-05 14:16:38 -07:00
James Cross	5c667ebe4e	AttentionCell Summary: This diff creates a generalized AttentionCell class, which will allow us to construct attention decoders out of arbitrary RNNCell components (with a particular view to using stacked, multi-layer RNNs). In order to do this, we introduce a new optional input for RNNCell._apply which allows us to provide an additional input that is not processed by prepare_input(). Note that this is an argument only to _apply, not apply, since it is only meant to be used for additional recurrent connections to "embedded" cells, not for standalone RNNs. Reviewed By: urikz Differential Revision: D4998465 fbshipit-source-id: 473009ea4917e86e365f9d23aa2f11a46a94fd65	2017-05-05 12:33:01 -07:00
Romain Cledat	aa5e771042	Added tiles and axis as input parameters to Tile Operator Summary: Added the possibility to add 'tiles' and 'axis' as input as opposed to arguments for the Tile Operator. If provided, the input values will override the argument values. Now with proper CUDA code Differential Revision: D4930347 fbshipit-source-id: b44b032b327c7d7bddfce63abf4e3289d7e74bfb	2017-05-04 23:46:51 -07:00

... 3 4 5 6 7 ...

636 Commits