Commit Graph

45 Commits

Author SHA1 Message Date
Yinghai Lu
ef8f556212
[Caffe2] Changes done inside Facebook (#6378)
* fix unit test for sqrt op

From the error logging:

[idx, grad, grad_estimate] are:
[[ 146.            0.5           0.45776367]
 [ 147.            0.5           0.45776367]

The gradient == 0.5 is correct, which means the SqrtOp and its gradient is doing right job. (Because y = sqrt(x), loss = y^2/2 = x/2, and then d(loss)/dx = 1/2 = 0.5; )

The test failed because of numerical problem of grad_estimate (in unit test). It can be because the step_size is small, and float precision is not high (when there are multiple elements in the tensor, we do sum(y^2) to compute loss)

This diff
- increase the step size, and also move the test cases to be further away from 0 (where sqrt(x) is not well defined) to be safe :)
- also clean up, and merge the test case for inplace Vs. non-inplace

Tested with:

`CAFFE2_HYPOTHESIS_PROFILE=debug ai_bt caffe2/caffe2/python/operator_test:elementwise_ops_test -- "test_sqrt"`

* CompositeReader & CompositeReaderBuilder

A new type of reader gluing multiple readers together.

* Back out "Revert D7394363: [GanH]: Log D Trick for Cross Entropy with Sigmoid"

Original commit changeset: 9325a4356dbe

* [dai][WIP] convert params to int8 on ps before sending to trainer

Add float->uint8 conversion in addition to float->fp16 conversion in model_saver.

* [easy] improve unit test for sparse length sum ops

as desc.

#accept2ship

* Update GitHub upstream to 771fcb3455

* move sparse hash unique ops to OOS and add unit tests

- move the SparseHash version to OOS, since 'sparsehash' is already deps of caffe2 OOS: https://fburl.com/arssw4n1
- The 'SparseHash' engine is also being used in OOS, so the SparseHash version shall be in OOS to reduce confusion: https://fburl.com/o5ea7ah2

- fix the CUDA UniqueOp for the case when batch is empty.
- add unit test

* group_norm_op for caffe2

This is the cuda op for Group Normalization (GN): https://arxiv.org/abs/1803.08494

This code implements GN in one op that computes Y=gamma * (X-mu) / sigma + beta and also its gradients. It is expected to have minimal memory consumption (similar to the BN op), without creating new blobs if GN were implemented as several ops (e.g., reshape, norm_mean/std, affine_channel).

* Resubmit D7405233: disappeared in D7464958

OOS publish causes the op missing -- however, test was still there

* [c2] add sparse hash engine for cuda unique op

The SparseHash version of UniqueOp copy input tensor to CPU, and make use of sparse hash map to get unique output, and then copy back to GPU.

* [dper][gpu] enable unit testing gpu trainer for sparse nn

to debug the GPU trainer using mock data in unit test.

make it easier to develop GPU trainer for new models.

* Reuse Gloo context for Synchronize() calls

Previously we were creating (and leaking) the Gloo context on each call to Synchronize(). Now only run the common world op and create the barrier net once, then run the barrier net on each Synchronize() call. Since timeout is associated with the Gloo context, assert that the timeout is fixed instead of trying to handle the complexity of multiple timeouts (and associated contexts).

* [GanH/WGAN][1/n]: add FC param clipping

as titled

* [mobile] minimizing changes between caffe2_benchmark and speed_benchmark

* [GanH]: enable diagnose within model

avoid finding blob names but to directly enable inside the model

* Add `net_transformer_fun` option to DPM

This callback allows for various transformations to be made to the
model after gradient operators have been added. The immediate motivation for
this is to allow transformations such has "checkpoint-and-recompute" which
allow trading off memory for additional compute.

Adding several callbacks like this has made DPM's API less than ideal at this
stage. However, I could not find any reasonable alternative.

* [DT] [33/n] Compile flow task groups

task groups need to compiled in order to pickle the object in fblearner. However I also changed the Job's compile function as creating new object is not necessary.

* Initial commit for sparse_normalize vectorization and benchmark

* [GanH]: LB Calibration for JSD

as titled

* Tracing event in async executor

Adding event tracing through TRACE_EVENT macro in async executor

* [Resubmit] D7409751 Reseting book-keeping blobs when the reservoir is reset

D7409751 got lost in D7464958

* Visualizing realtime weights values

we want to visualize the weights values as optimizer is iterating. This diff supports to visual the weights at an assigned index.
Currently, we assume the blob to be 2 dimensional.

* [GanH][Easy]: Fix Homotopy Weighting

apparantely, there was a bug in homotopy weight (alpha, beta) update

* [c2] move sparse hash unique op out of oss

so that oss do not need to depend on google hash map.

* Get rid of std::round as it's not supported on Android

* Revert changes on setup.py

* Skip shaky test on Dataio

* fix
2018-04-10 21:11:43 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Lukasz Wesolowski
29a4c942fe Add support for multi-device batch normalization through an option to data_parallel_model
Summary: Stage 3 in stack of diffs for supporting multi-device batch normalization. Adds input parameter to data_parallel_model to enable multi-device batch normalization. Depends on D6699258.

Reviewed By: pietern

Differential Revision: D6700387

fbshipit-source-id: 24ed62915483fa4da9b1760eec0c1ab9a64b94f8
2018-01-24 13:24:06 -08:00
Pieter Noordhuis
1351152362 Skip DeviceShiftTest if host has < 4 GPU devices
Summary: Closes https://github.com/caffe2/caffe2/pull/1563

Differential Revision: D6471667

Pulled By: pietern

fbshipit-source-id: 99efd21b98c00eb0a846ca8b395bdfd550fe02f1
2017-12-03 16:02:05 -08:00
Aapo Kyrola
2caca70a37 Allow shifting of activations / ops to other GPUs in data parallel model
Summary:
(Work in progress). This diff will allow shifting of activations to other GPUs, in case the model does not fit into memory. To see the API, check the code in data_parallel_model_test, which tests shifting two activations from 0 and 1 to gpu 4, and from gpu 2 and 3 to gpu 5.

I will need to further test on ResNets, and probablly add copy operations to handle device change points.

Reviewed By: asaadaldien

Differential Revision: D5591674

fbshipit-source-id: eb12d23651a56d64fa4db91090c6474218705270
2017-11-29 21:17:00 -08:00
Qinqing Zheng
4471e15b76 BMUF cpu support
Summary: change the interface so BMUF can run on cpus

Reviewed By: asaadaldien

Differential Revision: D6356026

fbshipit-source-id: f58a4da9f800d969145a1a376e118b0f3581f8c1
2017-11-19 23:41:25 -08:00
Aapo Kyrola
998a1b6d74 fix memonger after D5994548
Summary: memonger.cc's support for RNNs was broken in D5994548, because it changed a .n argument to .s argument. That made data_parallel_model_test fail (but tests were not run for the blame diff, so this was not noticed).

Reviewed By: kennyhorror

Differential Revision: D6043948

fbshipit-source-id: d29abd6927c519227a28b41c1ef70fb1756904bf
2017-10-12 17:36:27 -07:00
Aapo Kyrola
d748c43f71 for dpm.GetLearningRateBlobNames
Summary:
I broke dpm.GetLearningRateBlobNames() when adding a new nodename param in optimizer.
Fixing it.

Reviewed By: asaadaldien

Differential Revision: D6043828

fbshipit-source-id: b3a79dd0dfae144187bcb359e2374eab6b32c485
2017-10-12 17:20:33 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Junjie Bai
d9b0bcd7a4 Make all existing (except in RoIPool) "is_test" arguments required
Reviewed By: akyrola

Differential Revision: D5830168

fbshipit-source-id: 8634e9cfe308ba0ee90cd8a5c4b09a47b0b5f015
2017-09-25 23:46:12 -07:00
Aapo Kyrola
9ec981b866 for CPU-data parallel, allow sharing model
Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode.

Reviewed By: wesolwsk

Differential Revision: D5812181

fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea
2017-09-15 16:19:37 -07:00
Luke Yeager
f775149205 tests: use assertRaises, not expectedFail
Summary:
I would expect that tests marked "expected failure" mean that there is a known issue in the code which will be fixed later. Both of these tests are simply verifying proper error-checking - nothing needs fixing.

Before (looks like something is wrong):
```
======================================= 2 xfailed in 0.27 seconds =======================================
```
After:
```
======================================= 2 passed in 0.28 seconds ========================================
```
/cc akyrola gsethi523
Closes https://github.com/caffe2/caffe2/pull/1209

Differential Revision: D5825373

Pulled By: akyrola

fbshipit-source-id: 1b98f503e4e406f69567d02425532f43bd16a465
2017-09-13 11:39:35 -07:00
Aapo Kyrola
93bd3c77f8 AddBlobsSync()
Summary: Explicit function to sync blobs. Notice that this must be called before CreateNet(), and syncs the blobs every run.

Reviewed By: asaadaldien, jay-mahadeokar

Differential Revision: D5805891

fbshipit-source-id: 58a1bb47805d75d5cbead136e2e0e9fe663ea954
2017-09-12 10:33:22 -07:00
Aapo Kyrola
b7997a0f41 support device ids>10
Summary: Data parallel model failed with device numbers 10, 11.. because it used string sorting of the blob names. Changed to make sorting happen based on device number and then blob name. Also added reduction for 16 devices.

Reviewed By: wesolwsk

Differential Revision: D5781521

fbshipit-source-id: 16be0984ecb55340604c82893be366c0528e822c
2017-09-07 00:01:33 -07:00
Pieter Noordhuis
6d5c3eaeb7 Add CloneCommonWorld op
Summary:
Cloning was previously done by overloading CreateCommonWorld op.
Closes https://github.com/caffe2/caffe2/pull/1159

Reviewed By: andrewwdye

Differential Revision: D5757580

Pulled By: pietern

fbshipit-source-id: 9e80b295e390bf92623bafb72be21cbafdcf2ff4
2017-09-06 13:32:30 -07:00
Christopher Hay
cc3662e939 Added support for scaling learning rate of Caffe2 optimizers during training
Summary: While there is currently support for scaling the base learning rate when loading the model, there is not support for scaling the base learning rate during training. This is needed for LATTE's seq2seq translation models, as the learning schedule is not predefined and is modified at runtime.

Reviewed By: jhcross

Differential Revision: D5701391

fbshipit-source-id: ae3bec45f238db1a2be7af9c04d720067e9095d5
2017-08-25 19:04:47 -07:00
Christopher Hay
ad07f5f05d Added norm-based gradient clipping to optimizer library
Summary: Moved code for global norm-based gradient clipping from fb specific workflows (seq2seq) to the open-source caffe2 optimizer library

Reviewed By: jhcross

Differential Revision: D5637453

fbshipit-source-id: 7e73c9a1c97c28a152c188467b27a6449f79242e
2017-08-24 10:17:50 -07:00
Yangqing Jia
1db7a99249 disable travis test for dpm test
Summary:
After this, we should have test going back to all green.
Closes https://github.com/caffe2/caffe2/pull/1058

Reviewed By: harouwu

Differential Revision: D5637495

Pulled By: Yangqing

fbshipit-source-id: ac3ab5a27bc56e3bb08fa81aa8ed186cb7e8832b
2017-08-15 19:17:41 -07:00
Aapo Kyrola
26645154bb warn about using test/val model with init_params=True + fixed some cases
Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...).

Reviewed By: asaadaldien

Differential Revision: D5509963

fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f
2017-07-27 13:20:27 -07:00
Aapo Kyrola
af1e45c1e1 support appending net and converting them
Summary:
As per rushabhmshah99 request: he wants to append a pre-trained model (without training that) to the model.
So added data_parallel_model.ConvertNetForDevice() to enable that. The unit test shows example how to use this with
AppendNet, and I also added a blurb to the function.

Differential Revision: D5503335

fbshipit-source-id: b2a5db5c1739dc97f46dd0d7606ed555d99255b8
2017-07-27 11:07:48 -07:00
Ahmed Taei
804ebf7c41 Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example.
Reviewed By: akyrola

Differential Revision: D5463772

fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d
2017-07-21 16:24:10 -07:00
Geet Sethi
11c4647447 Allow CPU device scope in data_parallel_model and data_parallel_rendevous device scope checks
Summary: Allowing CPU device scope instead of enforcing no device scope in data_parallel_model and data_parallel_rendevous.

Reviewed By: akyrola

Differential Revision: D5440492

fbshipit-source-id: bcd4344d64c710ea50ec8a65e3e9d102e35c66ea
2017-07-18 15:47:41 -07:00
Geet Sethi
a68bb5e3f9 Added device scope checks to data_parallel_model and data_parallel_rendevous
Summary:
Added device scope checks to data_parallel_model and data_parallel_rendevous

Added test to check that checks are working correctly to data_parallel_model_test

Fixed device_scope error in test_synchronization_barrier

Reviewed By: akyrola

Differential Revision: D5403936

fbshipit-source-id: 849c1cd7452692efbc5ef74d2d60ede090c9c017
2017-07-12 10:47:28 -07:00
Andrew Dye
31f394f8b3 Add synchronization barrier API to data parallel model
Summary: Add synchronization barrier API with configurable timeout. Users can call Synchronize() to join variable length execution before resuming multi-machine communication steps, i.e., resuming distributed training iterations after validation on a single machine.

Reviewed By: akyrola

Differential Revision: D5348387

fbshipit-source-id: 5826da10e6a60c50394c36c7cf47624f10191d11
2017-07-06 09:21:19 -07:00
Luke Yeager
be7725b0ba Tests: fix dpm test when only 1 GPU present
Summary:
b33894e95d removed this line:
```py
unittest.skipIf(workspace.NumCudaDevices() < 2, "Need at least 2 GPUs.")
```
but forgot to add it back later.
```
_________________________________ DataParallelModelTest.test_equiv __________________________________
...
            if p2p_access_pattern is not None and not p2p_access_pattern[
>               devices[0], peer
            ]:
E           IndexError: index 1 is out of bounds for axis 1 with size 1
...
WARNING:data_parallel_model:** Only 1 GPUs available, GPUs [0, 1] requested
```

/cc akyrola
Closes https://github.com/caffe2/caffe2/pull/888

Reviewed By: akyrola

Differential Revision: D5341310

Pulled By: harouwu

fbshipit-source-id: 8d7f06913c7b5a42009a4033dbb6a48a8e812822
2017-07-05 14:32:12 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Thomas Dudziak
342de07231 Core unit test fixes for Python 3
Summary: As title

Differential Revision: D5291327

fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf
2017-06-23 13:22:16 -07:00
Ahmed Taei
ffd32c8ab7 Add distributed BMUF implementation.
Summary:
Refactor data_parallel_model all_reduce and broadcast methods to work for
a given parameter set not only gradients and reuse them for BMUF distributed
implementation.
Add a distributed test (multiprocessing) to BMUF.

Reviewed By: akyrola

Differential Revision: D5267083

fbshipit-source-id: 8dcc7527d0a755b903d693d8071585f0b54d3403
2017-06-21 16:18:11 -07:00
Aapo Kyrola
34eaa19d27 CPU data parallel model
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).

Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.

Reviewed By: wesolwsk

Differential Revision: D5277350

fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
2017-06-20 23:19:08 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Aapo Kyrola
cdb50fbf2b add optimizer support to data_parallel_model; Use MomentumSGDUpdate
Summary:
This diff does two things:
- add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model.

- use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits.

Changes resnet50 trainer to use optimizer.

This relies on D5133652

Reviewed By: dzhulgakov

Differential Revision: D5142973

fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2
2017-05-30 12:49:57 -07:00
Luke Yeager
6b1cf26380 Fix for dpm when GPUs don't have p2p access
Summary:
See discussion at https://github.com/caffe2/caffe2/pull/633#issuecomment-303536902

Tested with a TitanX (Pascal) and a TitanZ (Kepler) with this access pattern.
```
Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU1) : No
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU2) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> GeForce GTX TITAN Z (GPU2) : Yes
> Peer access from GeForce GTX TITAN Z (GPU2) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU2) -> GeForce GTX TITAN Z (GPU1) : Yes
```
All combinations pass:
* `0,1`
* `0,2`
* `1,2`
* `0,1,2`
Closes https://github.com/caffe2/caffe2/pull/659

Differential Revision: D5148779

Pulled By: akyrola

fbshipit-source-id: 6263edfe8b36623983f1946b5c3f4a3fef415a45
2017-05-30 12:02:19 -07:00
Deepak Gopinath
33c40e8a6e Handling shared indices in sparse gradient updates
Summary: When two or more blobs are gathered by the same indices blob in a data parallel model, we used to concatenate multiple times and re-write to the same indices blob. This leads to illegal memory access at times because the gradientslice indices blob is longer than its corresponding gradientslice values blob. This diff adds a check in order to avoid this.

Reviewed By: akyrola

Differential Revision: D5116817

fbshipit-source-id: 1c086d092eb6d48926d600f9408f578f5ddc41c7
2017-05-24 22:47:00 -07:00
Aapo Kyrola
a2c01e830b fix duplicate init blob issue + fix test
Summary:
Address KaimingHe's comments in D5093689 about same blob being initialized twice causing internal consistency check to fail. Also I noticed that my new test for test_checkpoint_params was completely botched due to an indentatino issue (it did not actually execute any test). So this fixes that as well.
 Modified the test to add a duplicate param initializer, so that this bug is tested for.

Reviewed By: KaimingHe

Differential Revision: D5101304

fbshipit-source-id: 72f343035c1b4953e7bb9a1a1c171cf05d3ead26
2017-05-20 09:18:29 -07:00
Aapo Kyrola
0af0cba2b7 Refactor data_parallel_model initial sync and checkpointing
Summary:
Major improvements. Before we only synced "params" and "computed params" of model after initialization and after loading a checkpoint. But actually we want to sync all blobs that are generated in the param_init_net. For example the _momentum blobs were missed by the previous implementation and had to be manually included in checkpoint finalization.

I also added GetCheckpointParams() to data_parallel_model because it is now fully general. Also added a unit test.

Reviewed By: andrewwdye

Differential Revision: D5093689

fbshipit-source-id: 8154ded0c73cd6a0f54ee024dc5f2c6826ed7e42
2017-05-19 12:48:06 -07:00
Ahmed Taei
25fd005dd9 Initial implementation of Blockwise Model Update Filtering (BMUF)
Summary:
A Single machine multi-GPU version of BMUF algorithm. BMUF is a modification to
model averaging where updates to global model is implemented as a filter:
param_t = param_(t-1) + delta
delta = \beta delta_(t-1) + \alpha average(param_t) - param_(t-1)

Reviewed By: akyrola

Differential Revision: D4995057

fbshipit-source-id: 48176ba66d67eaf3fa4dee16d50d9589825ddba4
2017-05-15 18:18:15 -07:00
Yury Zemlyanskiy
4bf559eddb RNNCell, LSTMCell, LSTMWithAttentionCell
Summary: This is the nice way to re-use RNN layers for training and for inference.

Reviewed By: salexspb

Differential Revision: D4825894

fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a
2017-04-18 00:47:20 -07:00
Aapo Kyrola
f94f43fd6e Working sparse gradients for data parallel model
Summary: This diff enables sparse gradient synchronization between GPUs. The test case is now a bit too convoluted, but once D4871680 is landed, we can simplify it a bit.

Reviewed By: dzhulgakov

Differential Revision: D4877087

fbshipit-source-id: 37bbb07051cbaf3a6e3c54b0eead97f3e02337d5
2017-04-13 17:39:23 -07:00
Aapo Kyrola
02f0c1c9d7 make memonger work with RecurrentNetwork(Gradient)
Summary:
This diff enables support of recurrent networks for memonger:
1. Memonger descends into the step-nets and renames the blobs accordingly
2. Memonger tells the gradient op about the renamed blobs by adding a parameter "paramname.renamed=<new name>"
3. RecurrentNetworkGradientOp applies remapping to links and gradient blobs.

I first thought of refactoring the whole gradient blob management of the recurrent network, but that looks to be very hard without a major revise of the code.

Note, I did not enable memonger for neural_mt, since I think the team should do more testing before enabling this.

Reviewed By: salexspb

Differential Revision: D4812823

fbshipit-source-id: 1ffdf3cfb4fcd00eec5bb0ece3bf416aa6d3e26b
2017-04-05 09:48:25 -07:00
Aapo Kyrola
91f468b15c fixes to make data parallel model work for RecurrentNet + test case
Summary:
First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made:
 - cell net/step net external inputs must be namespace scoped
 - prevent double-namescoping of cellnet inputs
 - make data parallel model understand recurrentnets so the device-mapping works

Reviewed By: salexspb

Differential Revision: D4708840

fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4
2017-03-14 15:48:07 -07:00
Aapo Kyrola
89c08334bb data_parallel_model support for sparse gradients and CPU ops
Summary:
Data parallel model did not support sparse operations, nor gradients computed on CPU ops.

Currently sparse operations are done on CPU, so there is no point of "data parallelizing" them. I had to make a few changes to data_parallel_model to support this:
 1. Model can have params that are added prior to adding the data parallel part. For example, a lookup table of word vectors would be a parameter that is non-parallel.
 2. Thus, when data parallel model is called, it will separate the non-parallel params and avoid working on them. Note: when we add distributed version, we need to explicitly handle them with AllGather!

This works nicely since Caffe2 automatically adds the backward concat-operator when multiple ops gather from the same blob.

I also added support for data parallel CPU ops, which might be necessary in cases when we don't have GPU implemenation of some ops.

Test in data_parallel_model_test validates the correctness of the code by running the same trainer on different number of gpus and checking the end result is same.

Reviewed By: jhcross

Differential Revision: D4649208

fbshipit-source-id: e3b7ae701ead468dc94c52a976eafec5c9831097
2017-03-09 13:48:41 -08:00
Aapo Kyrola
1c7886701e lr_scale to loss_scale
Summary:
As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/.

So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this.

In this diff I modified all my models to work correctly.

Reviewed By: Yangqing

Differential Revision: D4507002

fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90
2017-02-03 07:44:40 -08:00
Aapo Kyrola
3410939459 pass learning rate scaling factor to parameter update builder function
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.

Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.

Reviewed By: prigoyal

Differential Revision: D4248907

fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be
2016-12-05 11:53:26 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
d1e9215184 fbsync 2016-10-07 13:08:53 -07:00