Commit Graph

1520 Commits

Author SHA1 Message Date
Aarti Basant
fc56e86c7d Introduce init API for the optional Checkpoint Metadata Handler object
Summary:
Every call to the checkpoint_metadata_handler write() API requires us to pass all params like db_prefix, db_type etc.
Introducing an init API in the checkpoint_metadata_handler so that such params can be saved and need not be passed in every API call

Reviewed By: mraway, anshulverma

Differential Revision: D6792651

fbshipit-source-id: 059fa4309e8fce1ee5ab009af3e0570573c24245
2018-01-24 15:19:55 -08:00
Lukasz Wesolowski
29a4c942fe Add support for multi-device batch normalization through an option to data_parallel_model
Summary: Stage 3 in stack of diffs for supporting multi-device batch normalization. Adds input parameter to data_parallel_model to enable multi-device batch normalization. Depends on D6699258.

Reviewed By: pietern

Differential Revision: D6700387

fbshipit-source-id: 24ed62915483fa4da9b1760eec0c1ab9a64b94f8
2018-01-24 13:24:06 -08:00
Lukasz Wesolowski
9414072159 Add operators to support batch normalization across multiple devices on the same node
Summary: This is the first in a series of diffs to enable batch normalization across multiple devices on the same node with data parallel model. The diff contains the ops for computing the per-channel statistics required to obtain the mean and variance across multiple devices on the same node on the forward pass, and the gradient of the bias and scale during backpropagation. The actual modifications to SpatialBN and SpatialBNGradient to make use of these results will be in a separate diff.

Reviewed By: rbgirshick

Differential Revision: D6697336

fbshipit-source-id: 0de2750fe7e851795f238d9f625aeb4d74023dc2
2018-01-24 13:24:04 -08:00
Pieter Noordhuis
7a232aae49 Add random seed to NGramFromCategorical test
Summary: TSIA

Reviewed By: Yangqing, Maratyszcza, dzhulgakov

Differential Revision: D6797213

fbshipit-source-id: e1132229cda09d1fbde63686aaec81b995989c03
2018-01-24 13:05:28 -08:00
Xiaolong Wang
29c7c682d8 add NGramFromCategorical Op
Summary: as titled

Differential Revision: D6783763

fbshipit-source-id: 78280cf15c2cdc3c308562d3f27a81b61ef8d662
2018-01-23 15:08:25 -08:00
Xue Feng
0e9b0cf779 add error msg in fc input_record
Summary: as titled

Reviewed By: xianjiec

Differential Revision: D6787879

fbshipit-source-id: 4bbdd11455480b25fa18121fa4527a9f0a03addc
2018-01-23 14:48:15 -08:00
Anders Papitto
0aa1a6387e Add a seed to the gru unit test
Summary:
as it calls np.random and sometimes fails unreproducibly
Closes https://github.com/caffe2/caffe2/pull/1779

Reviewed By: pietern

Differential Revision: D6779802

Pulled By: anderspapitto

fbshipit-source-id: 2ad069f8a15f70a8110b1a6bdb06f81577c53ad4
2018-01-23 13:47:43 -08:00
Xianjie Chen
76a141f016 add error msg in get_key
Summary: as title

Differential Revision: D6782896

fbshipit-source-id: bd29f6d085e56f51deb4bf6ad81771787fd85a5a
2018-01-23 11:04:05 -08:00
Dániel Simig
2dd79eb53a Visualize distribution of activation functions
Summary:
This is a  first attempt at completing bootcamp task T24449916. This diff contains 3 major changes:
1) Change LayerModelHelper to allow for exposing the output and parameters of any layer to metrics
2) Added a runner that allows metrics to draw arbitrary plots to a matplotlib axes object
3) Implement a metric that aggregates distributions of values in a blob over the training, and try this out in a notebook

Reviewed By: kennyhorror

Differential Revision: D6671273

fbshipit-source-id: b8961837395e89c957edbf5c7c862bdb845ccf4b
2018-01-23 10:36:40 -08:00
Lin Yang
8e0177255e Test for PositionWeighted
Summary: add Test for SparseLookup with PositionWeighted.

Reviewed By: kennyhorror

Differential Revision: D6771612

fbshipit-source-id: b4b3bfd514f366f579b4192643330ae73843d4f9
2018-01-22 19:20:46 -08:00
Viswanath Sivakumar
231d6f7b09 Add SqueezeOp in MKLDNN
Summary:
SqueezeOp support to drop drop dims of size 1. MKLMemory now supports Reshape()
if the buffer is in plain layout, in which case just the dims and layouts are
modified similar to caffe2::Tensor. SqueezeOp takes care of converting the
input to plain layout if needed via an intermediate buffer before calling
Reshape().

Differential Revision: D6735656

fbshipit-source-id: 953309498370e1b8986e8c593bc6963f38036255
2018-01-22 18:39:42 -08:00
Wei Zhang
1d4e996b87 Separate parameter downloading tasks from training tasks and run them in a different group
Summary:
At the end of distributed training, trainer needs to download the parameters back from parameter servers for saving the model. Currently, this parameter downloading happens at the end of job's epoch task group, which creates several problems when checkpointing is enabled for distributed training:

1. When checkpointing is enabled, we run multiple training epochs. At the end of each epoch, the model download tasks will run to collect parameters, but we won't save the model until the true end of training, so there is a big waste of resource.
2. After trainer0 downloads the parameters, these parameters take a lot of memory, so trainer0 can easily run out of memory in the next epoch of training.

Our solution is to insert a parameter download task group between the job's training epoch_group and the job's exit_group.

Reviewed By: azzolini

Differential Revision: D6765393

fbshipit-source-id: 5a4f556fc3c1cd7834a7c406a3c0de3fccd50c49
2018-01-22 14:04:12 -08:00
Pieter Noordhuis
d618c05174 Increase lower bound of values for values in div test
Summary:
This should translate to an 1% error margin. The gradient checker uses a .5% threshold.
Closes https://github.com/caffe2/caffe2/pull/1766

Differential Revision: D6774077

Pulled By: pietern

fbshipit-source-id: f97c7ffb2ef34fdd71d69320a7fdcf4a6a457715
2018-01-22 09:06:12 -08:00
Viswanath Sivakumar
b5d513b1f9 Add op in MKLDNN
Summary:
Just redirects to MKLSumOp. Doesn't support broadcast though since dnnSumCreate
expects identical dims.

Differential Revision: D6729788

fbshipit-source-id: 3e189465ad9d026bec4954648562ffe4e67fc393
2018-01-21 08:21:43 -08:00
James Cross
91066559a8 truthy check for empty string in NameScope()
Summary:
As in name. LATTE translation team moving some code from Python 2 to 3 uncovered a case where comparison between unicode and str types leads NameScope('') to prepend a separator to the beginning of blob names. This fixes it.

Thank you so much to dzhulgakov for tracking down the cause of this so quickly!

Reviewed By: dzhulgakov

Differential Revision: D6766866

fbshipit-source-id: fbe46cff581f425ba10e8668400915ea40baab94
2018-01-19 21:34:09 -08:00
Ilia Cherniavskii
4ce4bc5c7f Fix occasional test timeouts
Summary: Make test less computationally expensive

Reviewed By: Yangqing, dzhulgakov

Differential Revision: D6766236

fbshipit-source-id: 59e51faa1331d804b11da9f7237ee9ce0cb27df8
2018-01-19 20:08:58 -08:00
Yangqing Jia
ced2c7e2b2 Remove Set/GetDefaultGPUID and move to use current gpu id instead.
Summary:
Reason for this change:

(1) Setting/Getting default gpu id doesn't seem to be used at all.
(2) It actually is confusing compared to the CUDA_VISIBLE_DEVICES options etc.
(3) When setting cuda_gpu_id=-1 in the CUDAContext arg, it used to use the
default gpu id but probably we should use the current gpu - so that the caller
will be able to control the device placement.

One use case is for TensorRT - if we have a custom callback layer, then it would
be easier for TRT or whatever caller to set the running device.

Reviewed By: dzhulgakov

Differential Revision: D6740357

fbshipit-source-id: 2ea710e434b10220d5a198e31c93847304636863
2018-01-19 18:03:21 -08:00
Peter Goldsborough
cded9683ad Implement fused 8bit rowwise sparse lengths reductions
Summary: Building on D6710785 (float <-> fused_8bit_rowwise conversions) and D6710843 (`FusedEmbeddingLookup`), this diff implements the new reduction operations for the fused 8-bit rowwise storage. I mostly followed the [old 8-bit quantized code](diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/lengths_reducer_rowwise_8bit_ops.h) and [full-precision code](diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/lengths_reducer_ops.h).

Reviewed By: kennyhorror

Differential Revision: D6710844

fbshipit-source-id: b9e85db7437bd32dd44d01733c3749f35c00b06e
2018-01-19 15:44:35 -08:00
Peter Goldsborough
8dc0702af5 Add float32 <-> fused_rowwise_8bit conversion Caffe2 operators
Summary: This first diff adds the conversion operators that go from float to our fused 8bit rowwise quantized storage and back again. For now I've put the scale and bias in front of each row because it makes the pointer arithmetic nicer here and in the EmebddingLookup perfkernel. If benchmarks or other reasons point out that this is a bad idea we can change it easily.

Reviewed By: kennyhorror

Differential Revision: D6710785

fbshipit-source-id: 086ab91c12d3b472564a06eff6329be6cb9e680e
2018-01-19 15:44:33 -08:00
Heng Wang
c052eb6bbb update the video input op in caffe2
Summary:
This is to update the video input op in caffe2 so that it is up to date.
It adds additional support for:
1, optical flow and early fusion
2, different ways of sampling clips from video
3, different ways of resizing the input video

Reviewed By: dutran

Differential Revision: D6752788

fbshipit-source-id: 0cbd4d4bbbe97b0ada4cba7a55adc91a7af60d5f
2018-01-19 09:52:25 -08:00
Lin Yang
4ea6e6a556 testSparseLookup
Summary: add basic test for SparseLookup

Reviewed By: kennyhorror

Differential Revision: D6749915

fbshipit-source-id: f97af785e4f89f36788a992843066fd1ec2b75a9
2018-01-19 09:27:20 -08:00
Orion Reblitz-Richardson
b28d5a3586 Build doxygen docs with cmake and fix catalog generation
Summary:
This updates https://github.com/caffe2/caffe2/pull/1096/ to build doxygen docs with cmake and fixes operator catalog generation. See the new README.md for details, but you can run

```
mkdir build && cd build
cmake -DBUILD_DOCS=ON .. && make
```
and

```
python caffe2/python/docs/github.py ~/c2docs/_docs/operators-catalogue.md
```

to generate docs.

There was one weird issue in `generator.py` that we sometimes receive tuples and sometimes objects. I handled this just by testing `isinstance`, but we might want to be more principled in the future.
Closes https://github.com/caffe2/caffe2/pull/1758

Reviewed By: pietern

Differential Revision: D6752127

Pulled By: orionr

fbshipit-source-id: 9ba9ad8efc920b27a57327f8a7d3050f3650d4ce
2018-01-18 18:47:59 -08:00
Anders Papitto
e3e6680b48 Add ElmanCell and ElmanRNN
Summary: Closes https://github.com/caffe2/caffe2/pull/1742

Reviewed By: dzhulgakov

Differential Revision: D6706809

Pulled By: anderspapitto

fbshipit-source-id: 15a05786a26aeb719ea4377f4dbbb62738d9e697
2018-01-18 12:14:02 -08:00
Anirban Roychowdhury
158e001238 Checking for positive epoch size before running epoch
Summary: Checking for positive epoch size before running epoch

Reviewed By: pietern

Differential Revision: D6738966

fbshipit-source-id: 64e1fb461d784786b20a316999e4c037787f3a14
2018-01-18 11:48:35 -08:00
Frank Jiang
6f0bb28afb Stop running RowWiseSparseAdam test on GPU
Reviewed By: pietern

Differential Revision: D6739194

fbshipit-source-id: 0892cdc6a575a84147f86984c67e7b4bf605a197
2018-01-17 15:05:21 -08:00
Frank Jiang
61356cbadc RowWiseSparseAdam operator
Summary: Added the RowWise functionality for SparseAdam, which saves roughly 2/3 memory usage by only keeping one first and second moment term for each row of the parameter tensor, rather than one for each individual parameter.

Differential Revision: D6679342

fbshipit-source-id: ce6fb27e35ce41a890c66f6089cd2748d10e7a44
2018-01-16 19:39:31 -08:00
Leon Masopust
81898e5d47 Fix for wrong newline in caffe_translator.py (Crop layer translation)
Summary:
- fixed the false newline at the initialization of the crop layer translation which caused the exceptions described in issue #1215
Closes https://github.com/caffe2/caffe2/pull/1746

Differential Revision: D6716228

Pulled By: Yangqing

fbshipit-source-id: dd93b06b3b903f96505d6e6f8e67caeb6981fe66
2018-01-12 16:17:53 -08:00
Anders Papitto
db6777eaf4 fix gru_cell bug
Summary:
the fc needs to be in the output_gate_t scope so it can find its input
weights correctly
Closes https://github.com/caffe2/caffe2/pull/1739

Reviewed By: dzhulgakov

Differential Revision: D6705443

Pulled By: anderspapitto

fbshipit-source-id: 139e83ac77589a203ffe404fedab98eea5b1a51c
2018-01-12 15:34:23 -08:00
Viswanath Sivakumar
b2964a92d9 Add MKLConcatOp
Summary:
MKLConcatOp along the channel dim of NCHW tensors. Spec:
https://software.intel.com/en-us/mkl-developer-reference-c-dnnconcatcreate

Reviewed By: ajtulloch

Differential Revision: D6689716

fbshipit-source-id: 492bc440474f8ce37caa85509789496659b03e79
2018-01-11 14:19:22 -08:00
Xue Feng
dda33ca53a enable setting model initialization seed
Summary: This diff enables setting model initialization seed, instead of random seed, when reproducible restults are desired.

Reviewed By: xianjiec

Differential Revision: D6642971

fbshipit-source-id: 387b1ee2ecef4f8f66570c882498fb97d7007e17
2018-01-11 14:04:03 -08:00
Aarti Basant
33d734fcf1 Generalize construction of db_name in checkpoint manager
Summary:
Instead of constructing db_name as a member of checkpoint_manager, generalize
this function

Reviewed By: anshulverma

Differential Revision: D6671088

fbshipit-source-id: c528538def66933619f2fdf67820bca5d13571ea
2018-01-10 11:49:17 -08:00
Di Yu
cd3e90c16f Fix failed test due to D6665466
Summary: Test in Jenkins fail becasue test_global_pooling_3d filtered too many tests.  We made use of infered value of global_pooling (pad and stride will be constant) to reduce the test samples generated.

Reviewed By: pietern

Differential Revision: D6686840

fbshipit-source-id: d316c0e9f9070b12770170ab9f36e33de68a9ab9
2018-01-09 16:40:35 -08:00
Di Yu
82198831e7 Fix pool op custom path issue 2, wrongful routing to global pooling
Summary:
In D5681122 - when routing to global maxpool and average pool, the condition is not correct.
see T24876217 for discussion

Reviewed By: Yangqing

Differential Revision: D6665466

fbshipit-source-id: dcb5b4686249e6ee8e1e976ab66b003ef09b32fd
2018-01-09 00:54:45 -08:00
Anders Papitto
12309f4aa6 GRU cell: add linear_before_reset boolean parameter
Summary:
This matches the semantics of cudnn (and others, like pytorch)
Closes https://github.com/caffe2/caffe2/pull/1695

Reviewed By: dzhulgakov

Differential Revision: D6658208

Pulled By: anderspapitto

fbshipit-source-id: 00e1716fba47b0ac296d1e9e0131165f4997ac7d
2018-01-08 13:22:56 -08:00
Yan Shang
41bb662d96 add dense regularization
Reviewed By: xianjiec

Differential Revision: D5617571

fbshipit-source-id: 875d7c8753bdb3b6847d5e3f47ad8568cdf172f8
2018-01-08 13:03:17 -08:00
Viswanath Sivakumar
073312eade Updates to MKL conversion script
Summary: Handling some special cases.

Reviewed By: ajtulloch

Differential Revision: D6647011

fbshipit-source-id: 6a434442da5e0a63d355242cb8df9418885c6fb4
2018-01-08 12:25:23 -08:00
Manoj Krishnan
6d32e36682 Caffe2 Operator: GPU implementation of Swish Activation
Summary: GPU (CUDA) implementation of the Swish activation function in Caffe2.

Reviewed By: Yangqing, xianjiec

Differential Revision: D6656907

fbshipit-source-id: f5f2c667055abf679728d2b5d43998895ddec708
2018-01-05 12:04:25 -08:00
Yangqing Jia
3725d8ea97 Disable the python op test numba import in asan
Summary:
Some installations of numba seems to be not compatible with asan, so we
will disable its import.

Reviewed By: dzhulgakov

Differential Revision: D6664055

fbshipit-source-id: 311774667e54bdbf328ef280ab2a52ecba1361f2
2018-01-04 17:49:21 -08:00
Alexander Sidorov
64b0039ef9 rnn_cell_test: make it determinitistic and speed up
Summary:
In this PR I do the following:

1. split lstm_test_main into several tests for LSTM, MiLSTM and various Norm based versions
2. instead of looping over various gradient / optimization parameters now they are random inputs through hypothesis.
3.  These change make the test faster and we can avoid limiting number of examples
4. Fix a minor bug with gradient checker in RNN unroll test running twice
5. Generate seed for numpy in hypothesis. This make hypothesis avoid having fluky tests

Also note that Norm tests sometimes fail. I haven't looked into it much, it could be just precision issues. New test split should help identify these issues.
Closes https://github.com/caffe2/caffe2/pull/1678

Reviewed By: pietern

Differential Revision: D6657076

Pulled By: salexspb

fbshipit-source-id: 9f59c71ccd2c818156e9d2424c3423d450b8c8e2
2018-01-04 15:00:42 -08:00
Lei Tian
3329f36f1a Move load_save_test.py from caffe2/python/ to caffe2/python/operator_test/
Summary: Move load_save_test.py from caffe2/python to caffe2/python/operator_test/

Reviewed By: boryiingsu

Differential Revision: D6657724

fbshipit-source-id: 030942316444ec93c3bc2970902d7b3980e60cfc
2018-01-03 17:42:55 -08:00
Pieter Noordhuis
9835ca9bac Ensure indices list in sparse optimizer tests is unique
Summary:
There were no dimensionality constraints to the generated indices
array, causing many examples being generated and filtered out. Instead,
we should ensure the probability of unique indices is high.

There is a better fix for this by using the `unique` keyword argument
to `hypothesis.extra.numpy.arrays`, but this is available only in
hypothesis version 3.28.0 and later.

This is related to #1536 and #1599.

Once this change has proven to be OK, we can modify the other tests
that now have health check suppression enabled as well.
Closes https://github.com/caffe2/caffe2/pull/1686

Reviewed By: Yangqing

Differential Revision: D6651789

Pulled By: pietern

fbshipit-source-id: d80886c9ccf0a7a842a7580a279f33a2d6cca97c
2018-01-03 12:19:14 -08:00
Lei Tian
56508566a1 Enhance Caffe2 Load op to support loading blobs from multiple files.
Summary: The current Load op can only load blobs from one file. We need to make the Load op to support loading blobs from a list of dbs.

Reviewed By: boryiingsu

Differential Revision: D6596034

fbshipit-source-id: 906fa48b0ad61c83e247d497b6b079c04fed499f
2018-01-02 18:02:19 -08:00
Yangqing Jia
77484ecc45 Manually applying cudnn5 pull request.
Summary: TSIA. Closes #1631

Reviewed By: pietern, Maratyszcza

Differential Revision: D6626887

fbshipit-source-id: 1a2dc7c47bc6ce794fdf598fbd547c04029edce4
2018-01-02 15:31:33 -08:00
Tiangao Gou
bc50510016 use numerically stable version of BatchLRLoss
Summary: change all use cases of BatchLRloss to the numerically stable version. This includes the uses of function build_loss defined in fbcode/caffe2/caffe2/fb/dper/layer_models/loss.py and class BatchLRLoss defined in fbcode/caffe2/caffe2/python/layers/batch_lr_loss.py.

Reviewed By: xianjiec

Differential Revision: D6643074

fbshipit-source-id: b5678556b03cbdd380cab8a875974a87c33d7f12
2018-01-02 13:18:36 -08:00
Kutta Srinivasan
bb04034bf7 Adding a time limit reader
Summary: ReaderWithTimeLimit() class to stop after a certain amount of time

Reviewed By: boryiingsu

Differential Revision: D6477623

fbshipit-source-id: 165874c9344b0c9c7e0b33e12e72e24c46669cb2
2018-01-02 11:33:53 -08:00
Viswanath Sivakumar
4cf13cf417 Fix crash due to copying empty tensors into MKLMemory
Summary:
Ran into a scenario where if the CPU op in MKLFallbackOp outputs an empty
tensor, attempting to copy the output to MKLMemory (https://fburl.com/www2mtt4)
crashes. Modify MKLMemory to gracefully handle this. This is done at the
MKLMemory level because we want to make sure that its members such as dims and
layout are Reset() correctly.

Interestingly, MKL calls fail at different points for dims {0} and dims {0,N} despite
the buffer size being empty for both - former in dnnAllocateBuffer and
the latter in dnnConversionExecute (likely due to some difference in
layout?).

Also fixed CopyTo in addition to CopyFrom and tested all scenarios.

Reviewed By: ajtulloch

Differential Revision: D6646320

fbshipit-source-id: 61df585f610a949f312f05308baf310241dc9cb2
2017-12-30 15:36:48 -08:00
Ves Stoyanov
1a0eefd5fc Parallelize batcher
Summary: Still WIP, but works for the universal encoder. The other ones are currently broken.

Differential Revision: D6492786

fbshipit-source-id: 232e0058eb3a0c036de3adf0295db5efd624cca7
2017-12-22 20:23:26 -08:00
Ilia Cherniavskii
5d3fc364aa Fix OSS build
Summary: Add missing .cc file into CMakeLists for pybind

Reviewed By: pjh5, houseroad

Differential Revision: D6625894

fbshipit-source-id: 900f10bf7d9abd1e2a1b8cdf56f098664a575889
2017-12-21 19:04:25 -08:00
Ilia Cherniavskii
a7ac591d3b Support for DLPack in Python op
Summary: Adding support for DLPack tensors to Python op

Reviewed By: Yangqing

Differential Revision: D6577702

fbshipit-source-id: e14ef213fcdb2930ffe164667971a92aa8db503c
2017-12-21 17:02:16 -08:00
Aarti Basant
8af9f0da99 Saving checkpoint failure should not cause job failure
Summary:
If we encounter failures while writing a checkpoint, ensure that the job does
not fail.
A job can make progress even if writing a checkpoint fails

Reviewed By: anshulverma, boryiingsu

Differential Revision: D6615163

fbshipit-source-id: 01f790422e1a81bab1fe73f86750eaf75a72bb77
2017-12-21 10:32:55 -08:00