Summary:
Replaced std::copysign(x) with (x > 0 ? 1 : -1).
std::copysign is not available on some Android platforms which was detected in GitHub's Travis tests:
"/home/travis/build/caffe2/caffe2/caffe2/sgd/yellowfin_op.cc:57:23: error: 'copysign' is not a member of 'std'"
Reviewed By: akyrola
Differential Revision: D5756384
fbshipit-source-id: 56bc220d2c6216ff45b9cc47ed02aebf6ad439a5
Summary: Disabling test for YellowFin that does not pass test in Travis. Difference comes from numerical reasons. Test passes on my cpu / math libraries. Decide whether to merge it.
Reviewed By: Yangqing
Differential Revision: D5754144
fbshipit-source-id: b6ed6628f962d6904a8d522f0cf4080d7878acad
Summary: Make CUDA version of SparseToDense, register EnsureDense (which is trivial) on CUDA. Need to use atomics because indices can be duplicated. We can later add an option to inform if the indices are unique, and use faster path then.
Reviewed By: jhcross
Differential Revision: D5750893
fbshipit-source-id: 005d1675b127a571aac8474fca62d9633f0c7bff
Summary:
Implementation of a new variant of attention module, which contains a recurrent decoder state with vectors corresponding to each source-side word and strictly increasing values, thus enabling it to model the degree to which source words have been translated.
The approach is a variant of the approaches described in https://arxiv.org/pdf/1601.04811.pdf. We simply include the sum of all previous attention weights for encoder words as a new recurrent state (coverage_t). A new linear transform on encoder_outputs is used to produce coverage_weights, which has the same dimensionality as encoder_outputs, and implicitly models the fertility of source-side words (and putting this extra information strain on the encoder network).
Thus the encoder output, the decoder state, and the coverage weights have the same dimensionality for a given source word, and attention logits are calculated as v * tanh(coverage * coverage_weights + encoder_output + decoder_state).
Note: the entire coverage state for each translation instance is of shape (encoder_length, coverage_units), but the states for the RecurrentNetwork operator, used to train the decoder, must be flat in the data dimension. This state is therefore initialized with shape (encoder_length * coverage_units) [not shown in the open-source library] and reshaped appropriately within the apply_soft_coverage_attention() function.
Differential Revision: D5593617
fbshipit-source-id: 7d0522b5eb0b26f22e8429e4461a459f2f16ed46
Summary: basic little op benchmark generator -- outputs init_net.pb and predict_net.pb for use with speed_benchmark or mobile_speed_benchmark
Reviewed By: Maratyszcza
Differential Revision: D5728534
fbshipit-source-id: 3e912fa63548497ca65ab34c8bb967694c46815b
Summary: Adding support to use kernels, strides, pads etc. as arguments.
Reviewed By: houseroad
Differential Revision: D5710699
fbshipit-source-id: 8b63af4c4a76cd06b637a376aeb29a34c659be2e
Summary: This will allow to do data reading in small batches and concat the batches later on.
Reviewed By: kennyhorror
Differential Revision: D5739129
fbshipit-source-id: 66a8087e5f9d10d654e367c6111ac90cbf54224e
Summary:
Added YellowFin optimizer to Caffe2.
This implemention is different from the original: It has separate alpha and mu for each parameter and it uses different version of Momentum SGD.
Tests / benchmarks for the optimizer are to be done. Some refactor of the code is to be done before pushing. This is still a working version.
Reviewed By: akyrola
Differential Revision: D5652689
fbshipit-source-id: c10dc0424f47c3051b454aede1d121902cb759a8
Summary:
1) Adds monitoring of CPU utilization in trainers and PS's, and report the utilization to global statistics
2) Adds the plan execution time to global stats
3) Uses CPU utilization and network utilization observed from performance estimation job to calculate the optimal number of parameter servers needed for the actual job. The optimal number of parameter server is the minimum number of servers needed while parameter servers are not the bottleneck in execution.
//Note: The calculation assumes that parameter shards are assigned to PS's in a uniform way and accesses to the shards follow a uniform access pattern. In reality, shards' access pattern may be skewed. As a next step, we should monitor shard access pattern in performance estimation job and distribute the shards in the optimal way.//
Reviewed By: sf-wind
Differential Revision: D5674398
fbshipit-source-id: 67a07cb9ed4e4d61ff5e81a0ecfe519b8feb2352
Summary:
Currently the loss ops are still not on GPU even though ALL strategy is selected.
This diff is to enable it.
Reviewed By: xianjiec
Differential Revision: D5671255
fbshipit-source-id: 033863f171e1f89c8d75430d3af6a1e6d0d2eff2
Summary:
This diff adds control flow operators in Caffe2 (starting with If, While):
- Added If operator that executes then/else subnet
- Branch subnet is executed in a separate isolated workspace, with some of the blobs transparently forwarded from the outer workspace
- Adding a new NetBuilder subclass to construct nets using new operator
- NetBuilder also keeps track of outer blob names and automatically sets blob bindings between outer and inner workspace, implementing generic convention on handling local/global variables in blocks
Reviewed By: volkhin
Differential Revision: D5720644
fbshipit-source-id: a674cde0c789f6a6ffdcd9d80159d1e42e49133f
Summary: While there is currently support for scaling the base learning rate when loading the model, there is not support for scaling the base learning rate during training. This is needed for LATTE's seq2seq translation models, as the learning schedule is not predefined and is modified at runtime.
Reviewed By: jhcross
Differential Revision: D5701391
fbshipit-source-id: ae3bec45f238db1a2be7af9c04d720067e9095d5
Summary: When we ported to memonger to C++ in D5544219, we forgot to include the special handling of RecurrentNetwork ops. This fixes that and adds a test.
Reviewed By: asaadaldien
Differential Revision: D5692407
fbshipit-source-id: 4e739b5dd6c7298303eee9bfa1aa4d19359eb7b5
Summary:
Before this diff, we were not respecting in-place blobs. E.g. if we had:
with DeviceOption(CPU):
blob = net.MyOpA([])
with DeviceOption(CUDA):
net.MyOpB([blob], [blob])
After the InjectCrossDevicesCopies we would have:
blob = net.MyOpA([], device=CPU)
blob_cuda0 = net.Copy([blob], [blob_cuda0], device=CUDA)
net.MyOpB([blob_cuda0], [blob], device=CUDA)
Basically, we were not respecting inplace blobs. After this diff, we'll keep the inplace blob.
Reviewed By: harouwu
Differential Revision: D5671867
fbshipit-source-id: 6ad68c612dae19d7e1f45f4988d929644100b4d5
Summary:
This diff adds control flow operators in Caffe2 (starting with If, While):
- Added If operator that executes then/else subnet
- Branch subnet is executed in a separate isolated workspace, with some of the
blobs transparently forwarded from the outer workspace
- Adding a new NetBuilder subclass to construct nets using new operator
- NetBuilder also keeps track of outer blob names and automatically sets
blob bindings between outer and inner workspace, implementing generic
convention on handling local/global variables in blocks
Reviewed By: azzolini
Differential Revision: D5641588
fbshipit-source-id: f9e04429961c3da7da4ebca3e8163bfcc2a09ec9
Summary:
_LSTM helper is a legacy piece we had before all the RNNCell awesomeness landed. Now we need to pull it apart and create separate building blocks that people can use for any RNNs.
Please note changes to a test with double scoping. That should go away once we change RNNCell scoping logic in such a way that each cells ads its own name to the scope for all of its outputs (see another diff: D5613139 )
Reviewed By: jhcross
Differential Revision: D5632276
fbshipit-source-id: 1cb568ab995c4c0b3dd1b4bad2d028e34bded9c1
Summary: These were missing and required for some seq2seq models. Unit tested. The previous implementation of ReduceBackMean shape inference was incorrect, so removed it.
Reviewed By: asaadaldien
Differential Revision: D5691262
fbshipit-source-id: 76f868b298440f988635966a410f0232301ca6c4
Summary:
Split the first dimension of a tensor into 2, the first of which is fixed and given in the argument.
This is used to then split batch into smaller batches and distributed it across workers.
Reviewed By: harouwu
Differential Revision: D5702175
fbshipit-source-id: 02bb93e49bf9db411b516e149c8e647301dd2ca5
Summary: This test was failing on non-GPU builds because it refers to operator CopyGPUToCPU. Thanks pietern for catching this.
Reviewed By: asaadaldien
Differential Revision: D5698763
fbshipit-source-id: 0bde0f3e99c58647dba2ea6da4d51938e763d10c
Summary: Moved code for global norm-based gradient clipping from fb specific workflows (seq2seq) to the open-source caffe2 optimizer library
Reviewed By: jhcross
Differential Revision: D5637453
fbshipit-source-id: 7e73c9a1c97c28a152c188467b27a6449f79242e
Summary: Currently, it's not easy to track down which tensor is missing type and shape info. Print it out for easier debuggin.
Reviewed By: volkhin, xianjiec
Differential Revision: D5695223
fbshipit-source-id: 7f0be0be777a35bb5a71b3799b29b91f0763c159
Summary:
Today, the PS's weirdly store the entire embedding and not just their
subsection of it. This was simply an oversight on the part of the original
author and this diff fixes that.
The sparse params are sharded to the PS's and the PS's just store their section
of the embedding. The trainer requests the id's as is from the PS. But the PS
divides the id by the num_of_shards before looking it up in the emdedding table
blob. This happens on the backward and the forward pass. However, during the
model download part, the PS multiples the embeddings with the num_of_shards
before returning them to the trainer. The upshot is that the trainer does not
know anything about how the embeddings are scaled on the PS. The PS adds extra
divide and multiply steps to achieve that.
2. During estimation time, we allocate just one PS for estimation. So in order
to make all of the embeddings fit on the single PS: We simply additionally
scale the hash table sizes (proportionally and equally for all the sparse
params) such that it fits. This scaling is handled analogously to (1).
Reviewed By: boryiingsu
Differential Revision: D5664093
fbshipit-source-id: 92f501f61566f939c41ce0b614a1b499669f978a
Summary:
This adds a fast path for global max pooling with NCHW. Compared to equivalent ReduceBackMean, this is about 3.5x faster.
Based on D5533059.
Reviewed By: akyrola
Differential Revision: D5681122
fbshipit-source-id: 7a4df934044c7dd01888f095f7dd46654aaf4eae
Summary: extend pairwise dot product for different number of embeddings on x & y dimensions
Differential Revision: D5663553
fbshipit-source-id: 1743a2c101cb8c0fc1f0f3d89c19530802400ec6
Summary: Making it more convenient to wrap code int context
Reviewed By: boryiingsu
Differential Revision: D5680991
fbshipit-source-id: 07b7e4d5aa657184039a7d18192b68fe11c1a570