Commit Graph

1367 Commits

Author SHA1 Message Date
Qinqing Zheng
ce62c65c18 momentum sgd
Summary: Add support for SparseMomentumSGDUpdate and tests for momentum SGD in both dense and sparse cases

Reviewed By: akyrola

Differential Revision: D6234834

fbshipit-source-id: 9848c29ea06794ef35f1ebaff0f5e81eac4f4db9
2017-11-03 16:17:17 -07:00
Alexander Sidorov
20feef45bc NNFC operator: an FC with noTrans noTrans options
Summary:
This seems to be faster in a bunch of cases. Prefer to keep it as a
separate op instead of MatMul + Add so its easy to compare perf on per
op basis between this one and the baseline (normal FC)

Reviewed By: akyrola

Differential Revision: D6169187

fbshipit-source-id: 09b96325d44bd181896f396aec88b27314c435b0
2017-11-03 15:08:39 -07:00
Philipp Keller
68ed66a2c5 Faster BatchBoxCox Operator using MKL
Summary: Use MKL VML vsPow() and row-major iteration for faster BatchBoxCox operator.

Reviewed By: kennyhorror

Differential Revision: D6042052

fbshipit-source-id: 54fc6b9184cb341672183a77730d79a271d09207
2017-11-03 12:04:03 -07:00
Aapo Kyrola
b71cebb11f Fix LoadModel() in resnet50_trainer
Summary:
resnet50 trainer will save the 'optimizer_iteration' blob in checkpoints, but loads it i in GPU context. This fails because AtomicIter/Iter expect the blob to be in CPU context. So manually reset the optimizer_iteration in CPU context.

I am thinking of making the iter-operators automatically do this switch, but in the mean time this unbreaks the trainer.

Reviewed By: sf-wind

Differential Revision: D6232626

fbshipit-source-id: da7c183a87803e008f94c86b6574b879c3b76438
2017-11-03 11:15:25 -07:00
Xianjie Chen
1b5c843a9c cleaner logic on sparse feature hashing
Reviewed By: kennyhorror

Differential Revision: D6195525

fbshipit-source-id: f687ac3d4914c3dbb0d35679e3a3d3a64a71ac53
2017-11-03 07:27:45 -07:00
Ilia Cherniavskii
1149b9bbb5 Polling async net executor
Summary:
Implementation of polling async net executor.
Notes:
- New net executor async_polling - schedules CPU and GPU ops asynchronously, uses single polling thread
- Events: update to Caffe2 events to support async CPU events, adding new methods:
 Query() - non-blocking checking of event states: INITIALIZED -> RECORDED -> SUCCESS/FAILED
 ErrorMessage() - when operation runs asynchronously and fails calling this on event will give error message
- Tasks: using existing DAGNet's algorithm to compute CPU and GPU chains, a separate task for each chain
- Polling: using single thread to query state of events - for CPU tasks atomically queries task state, for GPU task - uses cudaEventQuery; using Event
- Scheduling of CPU ops: using global thread pools
- Scheduling of GPU ops: using GPU thread pool per GPU device

Reviewed By: dzhulgakov

Differential Revision: D5985110

fbshipit-source-id: a9de7fcbb71d046a3aa1b573072b89a65dfeee8c
2017-11-03 07:27:44 -07:00
Dmytro Dzhulgakov
583bc63c98 Fix boundary checking in 8-bit sparselengthssum ops
Summary: Before the boundary checking was happening after the first access for 8bit ops.

Reviewed By: Yangqing

Differential Revision: D6206753

fbshipit-source-id: 07ab240cae8c67b3048f03aa79af0b6399b9940b
2017-11-03 05:19:57 -07:00
Andrew Tulloch
ebae2f6c71 MKL Sigmoid op wrapper
Reviewed By: Yangqing

Differential Revision: D6222910

fbshipit-source-id: 92d0825a6a35a4bf6a12636e3d5dd8affcffeef3
2017-11-02 17:30:29 -07:00
Andrew Tulloch
a7644e4f4b Extend rewrite functionality to handle multiple outputs.
Summary: Still assumes a complete subgraph, but slightly more generic.

Reviewed By: Yangqing

Differential Revision: D6103228

fbshipit-source-id: bfa0d46067e05baa0478a4c37a67ccf8f81f34ec
2017-11-02 17:30:27 -07:00
Andrew Tulloch
7244d27220 Add a EmptyDeviceScope (i.e. allow setting CurrentDeviceScope() to None)
Summary:
See comments for where this can be useful (disabling the
OperatorDef::DeviceOption(...) so we can control the scope at the
NetDef::DeviceOption(...) level).

Reviewed By: viswanathgs

Differential Revision: D6103412

fbshipit-source-id: 75a9be54275760132f6d1e71acbe9190e7099289
2017-11-02 11:25:48 -07:00
Aapo Kyrola
14f95c2782 Updated brew SpatialBN to use initializers
Summary: Updated brew SpatialBN to use initializers similar to other brew ops such as conv and fc instead of initilaizing all of its parameters itself within the brew call.

Reviewed By: asaadaldien

Differential Revision: D5840359

fbshipit-source-id: 9f3d688d4957605eaf7ecd2488bc26bfb1da3f78
2017-11-02 11:25:45 -07:00
Junjie Bai
7c2804ee90 Add support for doing broadcast with single elem dimensions at both ends
Summary: Closes https://github.com/caffe2/caffe2/pull/1413

Reviewed By: jamesr66a

Differential Revision: D6201556

Pulled By: bddppq

fbshipit-source-id: 1d443e895dbb3f5b67a5a0e027977b7807df3de1
2017-11-01 18:33:11 -07:00
Aapo Kyrola
b5c053b1c4 fix fp16 issues with resnet trainer
Summary:
My commit  bab5bc  broke things wiht fp16 compute, as i had tested it only with the null-input, that actually produced fp32 data (even dtype was given as float16). Also, I had confused the concepts of "float16 compute" and fp16 data. Issue #1408.

This fixes those issues, tested with both Volta and M40 GPUs. Basically restored much of the previous code and fixed the null input to do FloatToHalf.

Reviewed By: pietern

Differential Revision: D6211849

fbshipit-source-id: 5b41cffdd605f61a438a4c34c56972ede9eee28e
2017-11-01 13:30:08 -07:00
James Cross
397793d61c simplify beam search code
Summary: This cleans up the _hack_get_slice_end() using the Conditional operator.

Reviewed By: jmp84

Differential Revision: D6177797

fbshipit-source-id: 5ce0b76b8472123415bba39488aa2c69aad96111
2017-10-31 16:59:20 -07:00
Yongqiang Wang
db25f8602f Remove order by clause if it is not needed. Increasing timeout from 10mins to
Reviewed By: asaadaldien

Differential Revision: D6167599

fbshipit-source-id: 3e6bdd55d0aa5b497cc1871f237074b3b9ef6f29
2017-10-31 14:51:39 -07:00
Aapo Kyrola
cec27b8134 AddDistributedBlobsSync
Summary: Added a simple function to synchronize a blob across machines (but not across devices), i.e a blobs that are not synced over devices.

Reviewed By: yqwangustc

Differential Revision: D6192922

fbshipit-source-id: a4d653c9fb09f06b0c42330bdae07b42f5e6346c
2017-10-30 22:33:29 -07:00
Dong Li
3bfabb4d5f support float16 input for operator SparseAdagrad
Summary:
Implemented new CUDA class for operator SparseAdagrad. The param and moment inputs now can be float or float16.
The functions for mixed-precision add/mult/store are defined in a separate head file ("caffe2/core/float16_util.h") for reuse purpose.

Reviewed By: azzolini

Differential Revision: D5880200

fbshipit-source-id: dca227f38629a03a9d771f42efe2c0b673075c4d
2017-10-30 19:32:30 -07:00
Aapo Kyrola
669ec0ccba Added FP16 compute support to FC Op
Summary: Allow the GEMMs in the FC/FCGradient Op to do FP16 compute instead of FP32 if the appropriate op flag is set.

Reviewed By: asaadaldien

Differential Revision: D5839777

fbshipit-source-id: 8051daedadf72bf56c298c1cf830b019b7019f43
2017-10-30 17:03:51 -07:00
Aapo Kyrola
86e3e008e0 optimize RNN executor subnet construction for forward-only models
Summary:
RNN executor had a disadvantage to plain nets when running in forward-only mode: for plain nets, we only create two workspaces and two nets and alternate between them. With RNN executor, we had only four workspaces (4 > 2 because it was faster in some cases), but the nets (or rather the ops) were created for each of the timesteps. This has significant overhead. This diff changes this sos that if executor is is forward-only mode (i.e has limited parallelism setting), then it will use the same operators as the t - 4'th net -- excluding the ops that require the timestep blob. The latter exception is required because RNN executor needs different timestep blob for each timestep because it cannot modify the value of the timestep blob like when running nets in a loop.

Also removed redundancy in the dependency computation and added a debug flag to the executor that outputs the description of the rnn contents.

Reviewed By: salexspb

Differential Revision: D6155510

fbshipit-source-id: c47f727d2128649b081270d15020a08d41e5748d
2017-10-30 12:24:12 -07:00
Junjie Bai
b7a9f51de3 In BatchMatMul, add support for accepting inputs >=2d
Summary: Closes https://github.com/caffe2/caffe2/pull/1399

Differential Revision: D6183083

Pulled By: bddppq

fbshipit-source-id: 5c8f17c2de212fbc39a66c90aa2599b714f5ceb4
2017-10-29 23:38:33 -07:00
Qinqing Zheng
42ffb1ae07 support non-normalized weights
Reviewed By: akyrola

Differential Revision: D6158290

fbshipit-source-id: 4d54e5c0d0f91f23deab18da047df4d209d4c312
2017-10-27 23:18:25 -07:00
Aapo Kyrola
86dc6e0837 Added inverted FP16 Initializer
Summary: Added initializer which sets up the ParameterInfo object in the opposite format as the pFP16Initializer. This is needed for when the op requires the initialized blob to be FP32 but a FP16 copy of the weights is needed.

Reviewed By: wesolwsk

Differential Revision: D5840832

fbshipit-source-id: 439e87f41a1dbc58bf63a5c0e7f7fc4cb00b4d65
2017-10-27 10:20:04 -07:00
Jiyan Yang
ee3baa2ed4 Add shape checks and print more info in parameter sharing
Summary: As titled.

Reviewed By: kittipatv

Differential Revision: D6145747

fbshipit-source-id: 39a212bb6bebbbf3164cade2f95db22ddb2d2c87
2017-10-27 01:22:06 -07:00
Tilak Sharma
7b7dcaf269 Initialize presence tensor if data is empty.
Summary: See https://fb.facebook.com/groups/811605488888068/permalink/1645450575503551.

Differential Revision: D6116836

fbshipit-source-id: 3072643eaf6f134bda7d224af3d5f8339da1f39d
2017-10-27 01:05:42 -07:00
Qing He
0b0d5b2b1d Add tensor output that gives the sampled values
Summary: Given an additional tensor containing the values corresponding to the weighted samples, add tensor output that contains the values selected by the sampled indexes.

Reviewed By: akyrola

Differential Revision: D6050094

fbshipit-source-id: 1eccc641b99e30d36ae83d49f630b018a53e4147
2017-10-26 16:04:57 -07:00
Kittipat Virochsiri
879e39ea5c Distill loss with SigmoidCrossEntropyWithLogits
Summary: Sigmoid + CrossEntropy has numerical stability issue. The gradient of sigmoid is `dx = dy * y * (1-y)`. When `label=0` and `x` is large, `1-y` could be round to (near) 0 and we loss `dx`. Switch to `SigmoidCrossEntropyWithLogits` solve the issue because the gradient is not dependent of `y`.

Reviewed By: chocjy

Differential Revision: D6086950

fbshipit-source-id: f990ae726802aa5c56fa62cf5e23f2e61ee047fa
2017-10-26 15:18:34 -07:00
Bor-Yiing Su
e0fa72455d Fixes the checkpoint test.
Summary:
We need to use Cluster to isolate the definition of the nodes.
Otherwise, the contexts are polluted and the run becomes
stateful.

Reviewed By: Yangqing

Differential Revision: D6140404

fbshipit-source-id: 09d1c86ef12bb01eaa16b1dade4d2e1e93be287a
2017-10-26 13:18:21 -07:00
Jiyan Yang
6e33ae79df Add gradient op for WeightedSum op
Reviewed By: dzhulgakov

Differential Revision: D6149163

fbshipit-source-id: 0e8cf400323233d001243bc5cb25a0025115a564
2017-10-26 00:16:51 -07:00
Aapo Kyrola
63297e1a1f RunNetOnce->RunNet (removes rnn_executor overhead)
Summary:
seq2seq/translate.py was running much slower on RNNExecutor. This was because RNNExecutor has significant init overhead (I have another diff to reduce, but not completely eliminate it), and translate was calling the decoder with RunNetOnce -- thus always recreating the net and the ops. Changhing this to RunNet() makes translate run faster than without executor. RunNet uses the net name and uses the already created net, while RunNetOnce passes the whole protobuffer.

Noticed similar bug in seq2seq ensemble bean model, which also calls CreateNet() but uses RunNetOnce() instead of RunNet().

Reviewed By: jhcross

Differential Revision: D6156566

fbshipit-source-id: a933453e36a0d8fd163d0584186fda427a680687
2017-10-25 22:06:02 -07:00
Ahmed Taei
5bb8ed67e3 Compute GLU for an arbitrary axis
Summary: As in title

Differential Revision: D6151804

fbshipit-source-id: bd0fa08be1676ebd1abd9720711c221c61c11ad1
2017-10-25 19:49:55 -07:00
Yan Shang
39359afc84 Add rank loss for retrieval models with random negative sample
Summary:
In order to reproduce StarSpace model using the architecture of Two Tower model, we need to implement the ranking loss that is used in StarSpace as well as Filament model. In both StarSpace and Filament model, all negative samples come from random negative sampling, thus the number of negative sampler per positive record is fixed (say 64). To calculate the total loss, for each positive record, the hinge distance between the positive score and negative scores (the 64 scores in the example) are calculated. This diff implement this loss in Dper framework.

The main idea is to add an option so that negative_sampling.py can output random negative samples as an independent field rather than merged with the original input_record. In this way, we can calculate the positive score and negative score separately, which will eventually been used when calculating the ranking loss.

(Note: this ignores all push blocking failures!)

Reviewed By: kittipatv

Differential Revision: D5854486

fbshipit-source-id: f8a5b77be744a6cc8a2b86433282b3b5c7e1ab4a
2017-10-25 16:19:41 -07:00
Frank Jiang
d67624173b Change RowWiseSparseAdagrad assertion message
Summary: Made the asesrtion messasge clearer to let people know that rowwise is not supported for dense adagrad.

Differential Revision: D6135363

fbshipit-source-id: d706135a335305627310c69a2a6d7721b0a47f0e
2017-10-25 10:54:33 -07:00
Aapo Kyrola
241b9f6c14 disable rnn executor for beam search
Summary:
RNN executor has significant overhead of creating the timestep-nets the first time, and this is especially bad with beamsearch that is complex.
So disable RNN executor for now until perf regression is fixed (I have pending diff on it).

Reviewed By: salexspb

Differential Revision: D6138878

fbshipit-source-id: ce63ab9ce9cc1c0f67097aea1e370494ca98c680
2017-10-24 20:49:56 -07:00
Aapo Kyrola
2e4d8aa530 Added FP16/FP32 MomentumSGD + WeightDecay Update Ops
Summary:
Added two new ops, FP16MomentumSGDUpdate and FP32MomentumSGDUpdate, which perform both the momentum sgd and weight decay updates to a given parameter in a single op -- thus being more efficient.

Also updated the standard momentum sgd test to test if nesterov momentum works.

Reviewed By: asaadaldien

Differential Revision: D5837837

fbshipit-source-id: 5ad487b9c59434491d3a4fcfdeed820db6083f57
2017-10-24 12:28:16 -07:00
Bram Wasti
a0aa6d0e24 expose flop annotation to python
Summary: expose the flop annotation framework to python functions

Reviewed By: Maratyszcza, Yangqing

Differential Revision: D6135705

fbshipit-source-id: 2eed80b6cbda7b3ee3fe0e019a0f1fc4b0aa320b
2017-10-24 11:35:24 -07:00
Aapo Kyrola
388a1b1e66 Added FP16SgdOptimizer
Summary:
Added FP16SgdOptimizer to optimizers. The optimizer updates the params using the FP16MomentumSGDUpdate and FP32MomentumSGDUpdate ops. To determine which update op to call the optimizer expects either the fp32_update flag to be set, or that the blobs are in a recognized format created by initializers.py.

These requirements can be loosened if the blob DataType can be queried in python, though I am unsure of how to do this.

It also forces FP32 updates to SpatialBN as CuDNN does not support FP32 params for SpatialBN.

Reviewed By: asaadaldien

Differential Revision: D5840806

fbshipit-source-id: 84ab8dc11a6e91a198ed72c00287f4809607079d
2017-10-24 10:44:04 -07:00
Junjie Bai
ed08533a1e Add CUDA version of ScatterAssign
Reviewed By: houseroad

Differential Revision: D6128352

fbshipit-source-id: ea59f4bc723ef929b0f6ed15797df776d8054422
2017-10-24 10:20:03 -07:00
Aapo Kyrola
1b71bf1d36 Updated resnet50_trainer and resnet for more FP16 support
Summary: Added FP16SgdOptimizer to resnet50_trainer

Reviewed By: wesolwsk

Differential Revision: D5841408

fbshipit-source-id: 3c8c0709fcd115377c13ee58d5bb35f1f83a7105
2017-10-24 09:19:06 -07:00
Ahmed Taei
512a8015b8 Gated Linear Unit implementation
Summary: As titled

Differential Revision: D6117600

fbshipit-source-id: 84b0154dc4cf77cc9c9146e9a534c7485989346b
2017-10-23 18:14:57 -07:00
Yarik Markov
c6ef04db04 Add "dtype" parameter for GivenTensorOp
Summary: Adding "dtype" parameter for the GivenTensorOp. Also, providing backwards compatibility for the existing code, byt supporting the templating if "dtype" is not provided.

Reviewed By: bddppq

Differential Revision: D6090049

fbshipit-source-id: f5deaa57b49f2280289975f4583aba5bc064a2bc
2017-10-23 16:06:37 -07:00
Soumith Chintala
891f41c14b Upgrade to 2.2.1
Summary:
Update pybind from 1.8.1 to 2.2.1
aarch64 platform updates pending.

Reviewed By: houseroad, kmatzen

Differential Revision: D6089712

fbshipit-source-id: 80ce09c381717f4317e2e698479ff604cf28c709
2017-10-22 13:26:56 -07:00
Qinqing Zheng
6a4182eead weighted sample op cuda
Summary: CUDA version of weighted sampling operator; minor changes for CPU version

Reviewed By: asaadaldien

Differential Revision: D6106668

fbshipit-source-id: 42d7607bd845a4a39cf5b89d7476904cb5928431
2017-10-21 18:49:59 -07:00
Badri Narayan Bhaskar
25bfffeafe Swish Activation Function
Summary:
Swish: A self-gated activation function.
https://arxiv.org/pdf/1710.05941.pdf

Reviewed By: ajtulloch

Differential Revision: D6100424

fbshipit-source-id: 0103d6d82e9ffb50106c98a8785e62b8808e9af1
2017-10-20 10:37:43 -07:00
Junjie Bai
ee62a595fc ScatterAssign int types
Summary: Closes https://github.com/caffe2/caffe2/pull/1357

Reviewed By: dzhulgakov

Differential Revision: D6107036

Pulled By: bddppq

fbshipit-source-id: 9278dae988c3c0656b4e4fd08bf7ca1e2eec3348
2017-10-19 23:22:54 -07:00
Dmytro Dzhulgakov
623f2bf815 Add GivenTensorInt64Fill on gpu
Summary: Before we fix it properly with 'type' argument.

Reviewed By: bddppq

Differential Revision: D6103973

fbshipit-source-id: 8c00a93c373dd0ad0bbfe59944495f6574223ab6
2017-10-19 18:32:41 -07:00
Huazhong Ning
f7ad13694c support model init
Summary:
a parameter can be initialized multiple times in init_net if parameter sharing is enabled. With the original implementation, only the first parameter init will be replaced by pre-trained parameters and the next are still unchanged. This overwrites the initialization with pre-trained parameters.
This diff fixes this issue and also support model init for ads-intent project

Reviewed By: dragonxlwang

Differential Revision: D5991291

fbshipit-source-id: 36173f6239c56bd0d604a77bd94e36072f32faa7
2017-10-19 15:56:37 -07:00
Hassan Eslami
db6a9d2ae4 Fixes type inference for Slice and GivenTensor*Fill operators
Summary:
Currently, the type inference infers FLOAT as the type for all GivenTensor*Fill operators. However, the inferred type should match the actual operators.

Also, for `Slice` operator, there is a corner case where type inference fails

Reviewed By: azzolini

Differential Revision: D6096813

fbshipit-source-id: d65b7c0f42436138cbc49d8a5a62374fa5e927e1
2017-10-19 14:02:21 -07:00
Bangsheng Tang
7b30436201 remove Alias in SparseFeatureHash
Summary: remove Alias in SparseFeatureHash

Reviewed By: kennyhorror

Differential Revision: D6094663

fbshipit-source-id: f313aeb17bf6cfdacae62b2c1ad6b4175d0882dd
2017-10-19 13:24:20 -07:00
Hassan Eslami
8d8cebd6be Fixes the net-rewriting pipeline for model with rowwise adagrad
Summary: Model with rowwise RMSProp does not work in net-rewriting pipeline (fbl 29841194). This diff solves the issue by changing the way Slice op is used in the model and adds a rule to `parallelize.py` to cover for needed cases.

Reviewed By: azzolini

Differential Revision: D6096022

fbshipit-source-id: c4f615b2ba99da9f77a1d49c9fb898e0e59401f8
2017-10-18 20:05:37 -07:00
Junjie Bai
43b303bfc0 Expose Predictor::run_map to Python
Reviewed By: jerryzh168

Differential Revision: D6087316

fbshipit-source-id: d90e20429645391f17f0c56c8a8a60685097f801
2017-10-18 19:32:56 -07:00