Commit Graph

94 Commits

Author SHA1 Message Date
Qinqing Zheng
d013e16cf4 [C2] Enable LARS on GPU (#2115) 2018-03-02 18:06:19 -08:00
Qinqing Zheng
7cafdab69b [C2] Implement Layer-wise Adaptive Rate Scaling (LARS) (#2034)
* [C2] Implement Layer-wise Adaptive Rate Scaling (LARS)

* [C2] Implement Layer-wise Adaptive Rate Scaling (LARS)

* add unit test for Lars

* set default value for lars to be None

* remove lars for subclasses of SgdOptimizer
2018-02-25 14:58:31 -08:00
Frank Jiang
c809d89810 Fix RowWiseSparseAdam implementation
Summary: The original implementation averaged the momentum across the embedding dimensions, which doesn't make any sense. This meant all the embedding dimensions received the same update, becoming a very memory-expensive one-dimensional embedding.

Differential Revision: D7003135

fbshipit-source-id: ed54e3427bc13895a4e949e96b4b17f6ebfb6d53
2018-02-16 13:28:26 -08:00
Frank Jiang
61356cbadc RowWiseSparseAdam operator
Summary: Added the RowWise functionality for SparseAdam, which saves roughly 2/3 memory usage by only keeping one first and second moment term for each row of the parameter tensor, rather than one for each individual parameter.

Differential Revision: D6679342

fbshipit-source-id: ce6fb27e35ce41a890c66f6089cd2748d10e7a44
2018-01-16 19:39:31 -08:00
Hassan Eslami
8da31c240d Revert changes in blob name in optimizer
Summary: A while ago, we had to change some blob names in `optimizer.py` (more specifically, names of `iteration_mutex` and `optimizer_iteration`) to handle corner cases when preparing a net for parallel execution.

Reviewed By: azzolini

Differential Revision: D6480819

fbshipit-source-id: a03a7aa9fad322a50e7785914b0eb0f8654e6d90
2017-12-04 19:32:45 -08:00
Matthew Chan
bcc8c8f696 Support RMSProp in Caffe2.
Summary:
Add `RmsPropOptimizer` to `optimizer.py` so RMSProp can be used as an optimizer.

`RmpsPropOptimizer` uses `RmpPropOp` to update the gradient and `MomentumSGDUpdateOp` to update the model parameters.

Differential Revision: D6118279

fbshipit-source-id: e38b8380ff74c1d1bb1e87fc300b6b55e32cd2e0
2017-11-08 16:43:18 -08:00
Andrey Malevich
84067bc17d Make RowWiseSparseAdagrad type/shape inference compatible.
Summary:
Current version of the code is not supporting type and shape inference that is
going to make all places that rely on it fail misserably.

I'm still leaving option of doing init in the old way in case if some places
are already failing this inference logic.

Reviewed By: ffjiang

Differential Revision: D6241270

fbshipit-source-id: e9080ffe93d610b5ada58ebe66579acfa57c6b3c
2017-11-06 00:50:44 -08:00
Qinqing Zheng
ce62c65c18 momentum sgd
Summary: Add support for SparseMomentumSGDUpdate and tests for momentum SGD in both dense and sparse cases

Reviewed By: akyrola

Differential Revision: D6234834

fbshipit-source-id: 9848c29ea06794ef35f1ebaff0f5e81eac4f4db9
2017-11-03 16:17:17 -07:00
Frank Jiang
d67624173b Change RowWiseSparseAdagrad assertion message
Summary: Made the asesrtion messasge clearer to let people know that rowwise is not supported for dense adagrad.

Differential Revision: D6135363

fbshipit-source-id: d706135a335305627310c69a2a6d7721b0a47f0e
2017-10-25 10:54:33 -07:00
Aapo Kyrola
388a1b1e66 Added FP16SgdOptimizer
Summary:
Added FP16SgdOptimizer to optimizers. The optimizer updates the params using the FP16MomentumSGDUpdate and FP32MomentumSGDUpdate ops. To determine which update op to call the optimizer expects either the fp32_update flag to be set, or that the blobs are in a recognized format created by initializers.py.

These requirements can be loosened if the blob DataType can be queried in python, though I am unsure of how to do this.

It also forces FP32 updates to SpatialBN as CuDNN does not support FP32 params for SpatialBN.

Reviewed By: asaadaldien

Differential Revision: D5840806

fbshipit-source-id: 84ab8dc11a6e91a198ed72c00287f4809607079d
2017-10-24 10:44:04 -07:00
Hassan Eslami
8d8cebd6be Fixes the net-rewriting pipeline for model with rowwise adagrad
Summary: Model with rowwise RMSProp does not work in net-rewriting pipeline (fbl 29841194). This diff solves the issue by changing the way Slice op is used in the model and adds a rule to `parallelize.py` to cover for needed cases.

Reviewed By: azzolini

Differential Revision: D6096022

fbshipit-source-id: c4f615b2ba99da9f77a1d49c9fb898e0e59401f8
2017-10-18 20:05:37 -07:00
Aapo Kyrola
43adc5ba05 Add nodename to ONE, iteration_mutex etc.
Summary: Similar as with Iter, LR.

Reviewed By: azzolini

Differential Revision: D6005817

fbshipit-source-id: 6d1260791d1acb3df957315eb9156eac183ee25c
2017-10-07 22:06:11 -07:00
Ellie Wen
463bcd00ea add None check for scope.CurrentDeviceScope()
Summary: add None check for scope.CurrentDeviceScope()

Reviewed By: akyrola

Differential Revision: D6005320

fbshipit-source-id: 05e2515736dcb2bddbb47fa423f892091c4577d7
2017-10-07 17:38:30 -07:00
Ellie Wen
44a0f6805e fix get_cpu_blob_name()
Summary: add def get_cpu_blob_name(self, base_str) back before D6001124

Reviewed By: akyrola

Differential Revision: D6004994

fbshipit-source-id: 318581d2b2c22878929993160da8edcb7d7a58e6
2017-10-07 11:56:15 -07:00
Aapo Kyrola
dcfed49e96 fix multiple issues with multiple PS, learning rates, iter;
Summary: 1. iteration and LR must be node-name specific in optimizer

Reviewed By: azzolini

Differential Revision: D6001124

fbshipit-source-id: 0fa53fb3347e89401f62125865166356ac56796b
2017-10-06 19:21:16 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Frank Jiang
0a5ee1e806 Implemented RowWiseSparseAdagrad operator that only keeps one moment term per embedding
Summary: Implemented version of SparseAdagrad that only keeps track of an average sum of squared gradients term for each row of the parameter tensor, rather than a sum of squared gradients term for each individual parameter.

Differential Revision: D5881918

fbshipit-source-id: bd96ccf25554b457baaaca9309fc8048adbb37f7
2017-09-26 13:34:44 -07:00
Huazhong Ning
1a89c6e1ec Decayed adagrad
Summary: When trained on billions of data, the adagrad gradient square sum be very big and create an issue of adding small numbers to big numbers. This diff Allow to decay the adagrad gradient square sum.

Reviewed By: queqichao

Differential Revision: D5825932

fbshipit-source-id: 570224483b77d42ae53410fa2f767af86de167eb
2017-09-15 00:35:21 -07:00
Wojciech Glogowski
5ed5be71b1 YellowFin GPU class and Python optimizer
Summary: YellowFin GPU in .cu file, Python operator in optimizer.py

Reviewed By: asaadaldien, akyrola

Differential Revision: D5727450

fbshipit-source-id: 42a878e5fd35e288e0e6eeaa0bf980a9db96e5a7
2017-08-30 18:32:24 -07:00
Christopher Hay
cc3662e939 Added support for scaling learning rate of Caffe2 optimizers during training
Summary: While there is currently support for scaling the base learning rate when loading the model, there is not support for scaling the base learning rate during training. This is needed for LATTE's seq2seq translation models, as the learning schedule is not predefined and is modified at runtime.

Reviewed By: jhcross

Differential Revision: D5701391

fbshipit-source-id: ae3bec45f238db1a2be7af9c04d720067e9095d5
2017-08-25 19:04:47 -07:00
Christopher Hay
ad07f5f05d Added norm-based gradient clipping to optimizer library
Summary: Moved code for global norm-based gradient clipping from fb specific workflows (seq2seq) to the open-source caffe2 optimizer library

Reviewed By: jhcross

Differential Revision: D5637453

fbshipit-source-id: 7e73c9a1c97c28a152c188467b27a6449f79242e
2017-08-24 10:17:50 -07:00
Ahmed Taei
804ebf7c41 Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example.
Reviewed By: akyrola

Differential Revision: D5463772

fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d
2017-07-21 16:24:10 -07:00
Ahmed Taei
13980d2bb5 Set device to the default device(CPU) when DeviceContext is None.
Summary:
Fix case when optimizer isn't called within a device scope context.
Fix OptimizerContext lr blob names

Reviewed By: volkhin

Differential Revision: D5421046

fbshipit-source-id: 186a0d05f40d4442c5ba5736084626da73a0c0f1
2017-07-13 17:54:36 -07:00
Luke Yeager
82e318cf8b Optimizer: one LR op per (device, optimizer)
Summary:
Try running this script through `nvprof`:
```py
import numpy as np
from caffe2.proto import caffe2_pb2
from caffe2.python import brew, core, optimizer, workspace
from caffe2.python.model_helper import ModelHelper

do = core.DeviceOption(caffe2_pb2.CUDA, 0)
with core.DeviceScope(do):
    model = ModelHelper(arg_scope={'order': 'NCHW'})
    conv1 = brew.conv(model, 'data', 'conv1', 1, 20, 5)
    pool1 = brew.max_pool(model, conv1, 'pool1', kernel=2, stride=2)
    conv2 = brew.conv(model, pool1, 'conv2', 20, 50, 5)
    pool2 = brew.max_pool(model, conv2, 'pool2', kernel=2, stride=2)
    fc3 = brew.fc(model, pool2, 'fc3', 50 * 4 * 4, 500)
    fc3 = brew.relu(model, fc3, fc3)
    pred = brew.fc(model, fc3, 'pred', 500, 10)
    softmax, loss = model.SoftmaxWithLoss([pred, 'label'], ['softmax', 'loss'])
    model.AddGradientOperators([loss])
    optimizer.build_sgd(model, 0.01,
                        policy='step', stepsize=1, gamma=0.999,
                        momentum=0.9, nesterov=False)
    workspace.FeedBlob('data', np.zeros((1, 1, 28, 28), dtype=np.float32))
    workspace.FeedBlob('label', np.zeros((1, 1), dtype=np.int32))

workspace.RunNetOnce(model.param_init_net)
workspace.CreateNet(model.net)

for _ in range(100):
    workspace.RunNet(model.net)
```
Before this change:
```
                    1.55%  1.4185ms       837  1.6940us  1.6630us  2.4000us  [CUDA memcpy HtoD]
                    0.72%  656.03us       200  3.2800us  3.1350us  3.5840us  [CUDA memcpy DtoD]
                    0.39%  7.1574ms      1034  6.9220us  3.8300us  18.677us  cudaMemcpyAsync
                    0.00%  34.180us         3  11.393us  9.0960us  12.910us  cudaMemcpy
```
And after it (look at the third column):
```
                    0.73%  657.15us       200  3.2850us  3.1040us  3.6160us  [CUDA memcpy DtoD]
                    0.26%  235.07us       137  1.7150us  1.6640us  2.3680us  [CUDA memcpy HtoD]
                    0.20%  3.4493ms       334  10.327us  6.4220us  16.958us  cudaMemcpyAsync
                    0.00%  37.376us         3  12.458us  9.4120us  15.412us  cudaMemcpy
```
That makes a pretty big difference in performance. Is there any particular reason you decided to have a separate `LearningRate` op for every parameter in 1317e3498c?
Closes https://github.com/caffe2/caffe2/pull/893

Reviewed By: kennyhorror

Differential Revision: D5372541

Pulled By: asaadaldien

fbshipit-source-id: 57357e1be2d58ce294058e9422fb3b1eddfca24d
2017-07-12 21:17:49 -07:00
Tao Wu
b9e64ecef1 allow param_info to set optimizer
Summary: this diff adds optimizer into param_info, and the associated implementations for modelhelper and brew to set optimizer for each individual parameter.

Reviewed By: kennyhorror

Differential Revision: D5385432

fbshipit-source-id: 5d682f9d1ab077e04a5d76a24d71470f4e64fc92
2017-07-12 08:49:48 -07:00
Jiyan Yang
00e5afea6a Adding dedup aggregator options to sgd optimizer
Summary: As desc.

Reviewed By: xianjiec

Differential Revision: D5324671

fbshipit-source-id: 27f3a58f618cd5ea11c2ea2e756df3f73635c2c8
2017-07-04 02:10:18 -07:00
Yiming Wu
1fce3eac4e single trainer hybrid device
Summary:
First try of single trainer hybrid device training for sparsenn

Comparison results with CPU training:
https://our.intern.facebook.com/intern/fblearner/run/compare/?compare_to[0]=20016969&compare_to[1]=19660293&baseline_run=19660293&all_runs[0]=20016969&all_runs[1]=19660293

Reviewed By: dzhulgakov

Differential Revision: D5205723

fbshipit-source-id: 4a024324ac2efc3248dd470d4c533cf2ecec2e92
2017-06-27 22:06:30 -07:00
Simon Layton
eaacfc7e25 Fix multi-precision SGD outputs
Summary:
salexspb This fixes a major perf issue (40% boost on alexnet end-to-end perf) in the multi-precision SGD optimizer - it was causing repeated cudaMalloc / cudaFree calls during training iterations due to the changing size of the `grad` blob as it moved from fp16 <-> fp32.
Closes https://github.com/caffe2/caffe2/pull/797

Differential Revision: D5246978

Pulled By: salexspb

fbshipit-source-id: ec3d7ef18445e19eaf5aac908d0a7bcd5957eb60
2017-06-14 11:36:43 -07:00
Wael Abdelghani
ebecafbcca Support for position weighted in distributed PS
Summary: Title

Reviewed By: azzolini

Differential Revision: D5081871

fbshipit-source-id: 68a97c2112522fbcbcdfd9e0f717b8bce60fe028
2017-06-05 17:04:42 -07:00
Aapo Kyrola
401908d570 add_weight_decay + restore weight decay to resnet50_trainer
Summary:
Add add_weight_decay to optimizer + test.

In D5142973 I accidentally removed weight decay from resnet50 trainer, so this restores it.

Reviewed By: asaadaldien

Differential Revision: D5173594

fbshipit-source-id: c736d8955eddff151632ae6be11afde0883f7531
2017-06-02 14:16:56 -07:00
Simon Layton
58874ad5bf Fp16 training initializers
Summary:
Re-open for re-importing :)
Closes https://github.com/caffe2/caffe2/pull/721

Differential Revision: D5164345

Pulled By: akyrola

fbshipit-source-id: e80b32556cd25610602df91a4225b93edc0ca40b
2017-06-01 08:34:46 -07:00
Aapo Kyrola
ffbba0fae7 add model_helper Validate() + sprinkler around
Summary:
Recent diff introduced a duplicate parameter to the model, which would hurt the performance and also affect correctness (duplicate momentum updates, for example). We unfortunately had no checks for duplicate params, outside of data_parallel_model, which fortunately brought this into our attention.

But it is better to have a Validate() function in model_helper, and call that before adding gradient ops and querying for parameters. Added to brew_test calls as well.

Reviewed By: kennyhorror

Differential Revision: D5163458

fbshipit-source-id: 35692e8bfcc359d4e8bc73e6f2358659f6e45ceb
2017-06-01 02:36:47 -07:00
Aapo Kyrola
0f8c8f37a8 Revert D5159712: [caffe2][PR] Fp16 training initializers
Summary: This reverts commit 60a889494d2e2f4df1d720331e19f638c5eb95cc

Differential Revision: D5159712

fbshipit-source-id: 16040c911b260648857f656f92b165f92c2daae0
2017-06-01 00:17:14 -07:00
Simon Layton
2bfacff426 Fp16 training initializers
Summary:
Adds support for generating and training pfp16 models. Added SGD optimizer for multi-precision trainers and a new callback to data_parallel_model in order to help multi-precision models keep their different copies of parameters in sync during training.
Closes https://github.com/caffe2/caffe2/pull/697

Differential Revision: D5159712

Pulled By: salexspb

fbshipit-source-id: 60a889494d2e2f4df1d720331e19f638c5eb95cc
2017-05-31 17:46:58 -07:00
Aapo Kyrola
ce7ce46ca1 fix secondary device check by gradient, if it is sparse
Summary: Fix an issue where the parameter is not created in param_init_net, or net, and then we secondarily look at which device op outputs the gradient. This did not work if the gradient was a GradientSlice.

Reviewed By: harouwu

Differential Revision: D5153102

fbshipit-source-id: 20eae660ea32e5a9ea484bf93c04c8f8c71a51ed
2017-05-30 20:47:17 -07:00
Aapo Kyrola
cdb50fbf2b add optimizer support to data_parallel_model; Use MomentumSGDUpdate
Summary:
This diff does two things:
- add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model.

- use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits.

Changes resnet50 trainer to use optimizer.

This relies on D5133652

Reviewed By: dzhulgakov

Differential Revision: D5142973

fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2
2017-05-30 12:49:57 -07:00
Aapo Kyrola
44257ea5ed automatically infer device scope for param
Summary:
hankun is using the optimizer, but having mixed set of of GPU and CPU operators. Currently this won't work with optimizer since it adds optimizers for all parameters in the current device scope. But we can actually infer the device that a param belongs to by looking at the device option in the param_init_net.

Added a test as well.

Reviewed By: salexspb

Differential Revision: D5133652

fbshipit-source-id: ad8689d75ac1f5c78981bae1b6978fe91e40ef0f
2017-05-30 12:02:19 -07:00
Alexander Sidorov
016f72537a ModelHelper.create_param, Initializer abstraction and ParameterInfo for optimizers
Summary:
This is going to unblock Nvidia in their work on adding fp16
support to Caffe2. I discussed this with kennyhorror before to make
sure this fits into his work on parameter sharing.

Reviewed By: kennyhorror

Differential Revision: D5127797

fbshipit-source-id: 4db155d320b1862570c23b77c4252bdacbf2296f
2017-05-25 22:03:15 -07:00
Yiming Wu
0aeffa985e make sure mutex is on CPU too
Summary: mutex is only supported on CPU. need to make sure mutex and following atomicIter are both on CPU. This is critical for gpu SparseNN training

Differential Revision: D5093184

fbshipit-source-id: 021e6ba699a3208449fa4761cad6b0ec4544957e
2017-05-19 12:17:17 -07:00
Xiaolong Wang
add840510f Refactor Optimizer to Allow scale_learning_rate
Summary:
In transfer learning, parameter initialized from pretrained model might require
a different learning rate than otherwise initialized. To this end, here we
implement a python solution where `base_learning_rate` is scaled by `scale`,
which is in turn set by `scale_learning_rate`; Alternatively, we can achieve
same effect by rewriting the LearningRate operator in C++

Reviewed By: kennyhorror

Differential Revision: D4992827

fbshipit-source-id: 8d7e87a61c95b3eb8ef733ec436f4060e865c0ac
2017-05-09 13:16:21 -07:00
Alisson Gusatti Azzolini
20d8de8d51 Parameter cost estimation job
Summary:
Adds a parameter cost estimation step before the actual training starts. The costs are later used in order to better shard the parameters across instances of the parameter server.

Things I needed to modify:
- A few changes to make ModelLayerHelper picklable
- Add support for stopping a distributed job after a number of stats reporting steps.
- Refactored run_dist_job to support collocating the reader with the trainer even when PS are present.
- Option to disable dense updates (when num_dense_servers=0).

Currently there's a huge overhead posed by having to launch a child workflow. I'll try and address next in a subsequent diff.

This is WIP because the other workflows need to be migrated as well.

I can break this down into smaller diffs if reviewers would prefer it.

Reviewed By: kennyhorror

Differential Revision: D4974752

fbshipit-source-id: 04c336acb2945f8f11324a221ffc6967818c0672
2017-05-09 13:02:24 -07:00
Bor-Yiing Su
7270471ed6 Returns auxiliary parameters in the optimizers.
Summary:
1. Adds a function to return auxiliary parameters for each optimizer. This function can be used to serialize the optimizers so that they can be recovered.
2. Fixes the bug that the iteration blob is not incremented by one in each iteration. Suppose there are k parameters using the adam learning rate optimizer, the iteration blob is incremented by k based on the original implementation.

Reviewed By: azzolini

Differential Revision: D4872397

fbshipit-source-id: d86711feedda2ba83af5f2a18141b06a6a473733
2017-04-17 10:16:32 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Huazhong Ning
83437853ad refactor and modulize optimizers
Summary:
The current optimizer code in c2/python has the following issues:
(1) the optimizers in sgd.py cannot config per param-blob optimizer;
(2) sgd.py is a bad file name. optimizer.py is a better name;
(3) layer_model_helper.py has another set of optimizer code (which supports per param-blob optimizer)

This diff did the following
(1) create optimizer objects so that we can config per param-blob optimizer and that are also compatible to the existing optimizer code
(2) the new optimizer code are much more modulized
(3) move the optimizer code to file with better name (optimizer.py)
(4) replace the optimizer imports in the existing code

will do in next diffs
(1) optimizers with structured parameters for dper2
(2) get rid of the optimizer code in layer_model_helper.py

Reviewed By: salexspb

Differential Revision: D4609013

fbshipit-source-id: 2e2d6dfa8685d10498f89069157453d9feca3f27
2017-03-07 18:46:47 -08:00