Commit Graph

137 Commits

Author SHA1 Message Date
Alexander Sidorov
52befa4802 DataParallelModel: take param_init_net into account in _InferBlobDevice
Summary:
Here is my example:

For static RNN timestep is created as a part of param_init_net. Before DPM assumed that it is CUDA blob by default and it participated in broadcasting causing Copy on line 798 to fail. No device mapping is correct for this blob.

Reviewed By: akyrola

Differential Revision: D5631716

fbshipit-source-id: 28c3eb17ecc3080c95c41d69a60bf7262d3907d4
2017-08-15 12:06:46 -07:00
Zhaoming Wu
399fc9fb09 Added Nesterov
Summary: Added Nesterov momentum as an option for BMUF and corresponding tests

Reviewed By: asaadaldien

Differential Revision: D5599888

fbshipit-source-id: 30819c9e689347c8b75daddc7444bea9f54193ae
2017-08-11 13:52:43 -07:00
Priya Goyal
5c77cc8182 Exposing num_workers as parameter and enable recycling activations
Summary: as promised, a separate diff for dpm changes I made in experimental code

Reviewed By: pietern

Differential Revision: D5551304

fbshipit-source-id: 9013aeab6c388b1c415ffb2e36fb8dd6b8cf90b0
2017-08-08 19:48:41 -07:00
Ahmed Taei
647f35e742 Fix SyncAllParamsDistributed for Python 3x
Summary:
In Python 3x dictionary values aren't a list and can't be concatenated to a list
this diff should fix that.

Reviewed By: andrewwdye

Differential Revision: D5576724

fbshipit-source-id: c60441857ceceb9c4a71122d2db5e9abad6d3fc2
2017-08-07 14:23:32 -07:00
Aapo Kyrola
26645154bb warn about using test/val model with init_params=True + fixed some cases
Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...).

Reviewed By: asaadaldien

Differential Revision: D5509963

fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f
2017-07-27 13:20:27 -07:00
Aapo Kyrola
af1e45c1e1 support appending net and converting them
Summary:
As per rushabhmshah99 request: he wants to append a pre-trained model (without training that) to the model.
So added data_parallel_model.ConvertNetForDevice() to enable that. The unit test shows example how to use this with
AppendNet, and I also added a blurb to the function.

Differential Revision: D5503335

fbshipit-source-id: b2a5db5c1739dc97f46dd0d7606ed555d99255b8
2017-07-27 11:07:48 -07:00
Aapo Kyrola
3363681304 enable CreateCommonWorld to bootstrap from existing common world
Summary: Use romain-intel's ContextFactory to create common worlds from existing common worlds, thus bypassing KV store completely. Changed data_parallel_model to automatically find if there is already a CW we can work. CreateCommonWorldOp takes optional second parameter, which is existing CW.

Reviewed By: andrewwdye

Differential Revision: D5494956

fbshipit-source-id: 5f7a840bcd5fe4ea756fafeacc746bc2cf5078b0
2017-07-26 22:31:55 -07:00
Ahmed Taei
804ebf7c41 Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example.
Reviewed By: akyrola

Differential Revision: D5463772

fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d
2017-07-21 16:24:10 -07:00
Geet Sethi
11c4647447 Allow CPU device scope in data_parallel_model and data_parallel_rendevous device scope checks
Summary: Allowing CPU device scope instead of enforcing no device scope in data_parallel_model and data_parallel_rendevous.

Reviewed By: akyrola

Differential Revision: D5440492

fbshipit-source-id: bcd4344d64c710ea50ec8a65e3e9d102e35c66ea
2017-07-18 15:47:41 -07:00
Geet Sethi
ab0d631d6d Adding AllCompare-like function to data_parallel_model
Summary: Added function _RunComparison to data_parallel_model that checks if all shards in a given rendevous have the same value for a given blob_name

Reviewed By: wesolwsk

Differential Revision: D5394164

fbshipit-source-id: c2b07d0f8d5846fa9887d53b0be091a8c057f106
2017-07-13 13:03:57 -07:00
Geet Sethi
a68bb5e3f9 Added device scope checks to data_parallel_model and data_parallel_rendevous
Summary:
Added device scope checks to data_parallel_model and data_parallel_rendevous

Added test to check that checks are working correctly to data_parallel_model_test

Fixed device_scope error in test_synchronization_barrier

Reviewed By: akyrola

Differential Revision: D5403936

fbshipit-source-id: 849c1cd7452692efbc5ef74d2d60ede090c9c017
2017-07-12 10:47:28 -07:00
Ralph Mao
febae7b20b fix a bug in the report function of Data_Parallel
Summary: replace params with sp, otherwise it will report an empty list

Reviewed By: akyrola

Differential Revision: D5382716

fbshipit-source-id: 34d8e6ee00cbe1718702e3d1f23ea12f8d65063e
2017-07-07 13:03:46 -07:00
Andrew Dye
31f394f8b3 Add synchronization barrier API to data parallel model
Summary: Add synchronization barrier API with configurable timeout. Users can call Synchronize() to join variable length execution before resuming multi-machine communication steps, i.e., resuming distributed training iterations after validation on a single machine.

Reviewed By: akyrola

Differential Revision: D5348387

fbshipit-source-id: 5826da10e6a60c50394c36c7cf47624f10191d11
2017-07-06 09:21:19 -07:00
Aapo Kyrola
2d133d4627 increase concurrency default
Summary: Huge improvement in my tests, and it does not really hurt either.

Reviewed By: wesolwsk

Differential Revision: D5374925

fbshipit-source-id: c96a4ed2ca653120a82233c0037cbfded8a2d2a1
2017-07-05 21:46:31 -07:00
Simon Layton
090506ac87 Add NCCLBroadcast to correct net
Summary:
Otherwise was always added to main net instead of param_init_net when
desired (i.e. initial param sync)
Closes https://github.com/caffe2/caffe2/pull/894

Differential Revision: D5367451

Pulled By: akyrola

fbshipit-source-id: 3d82be6da687c736bd15f4852dbd272266eb4811
2017-07-03 16:54:44 -07:00
Aapo Kyrola
8c74c36626 fix reducing device option
Summary: This was broken in a previous diff, fixing it to use model device type.

Reviewed By: asaadaldien

Differential Revision: D5356005

fbshipit-source-id: a4fcc932bae772076b57625a5fcc0d38eb702cc9
2017-06-30 09:19:57 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Yongqiang Wang
ea659b8f2e broadcast to global parameters when using warmup
Reviewed By: asaadaldien, jay-mahadeokar

Differential Revision: D5340692

fbshipit-source-id: 80879847ff71c8d620de502ef95a9ffb4bdf595d
2017-06-28 13:35:27 -07:00
Ahmed Taei
fbe2526343 Allow concurrent execution of GLOO broadcast collectives in
Summary:
This add CollectivesConcurrencyControl class to mange creating common context and cyclic controls to execute GLOO collectivces
and refactors AllReduce and _AddDistributedParamterSync to use it

Reviewed By: akyrola

Differential Revision: D5335795

fbshipit-source-id: 5084e0a65cdb989cd949be3868b77a680561022d
2017-06-28 12:49:12 -07:00
Henry Lu
9a14c013c3 Refactor data_parallel_model to take advantage of Gloo broadcast op in broadcasting across machines and GPUs in one operation
Summary: Combine _AddDistributedParameterSync() and _SyncParams() into a single function to broadcast across distributes machines and all local GPU simultaneously. This is similar to how calls to Allreduce has already optimized using the functionalities of Gloo. All the refactoring work is contained in data_parallel_model.py.

Reviewed By: akyrola, andrewwdye

Differential Revision: D5329277

fbshipit-source-id: 4407b88980cf396f2e0f994d796294fa79fd39ed
2017-06-27 19:35:24 -07:00
Simon Layton
d45f722e43 data_parallel_model: NCCLBroadcast root fix
Summary:
The root is the root _rank_ and not the root _device_. Thus we always
use root=0, regardless of the devices used.

https://github.com/NVIDIA/nccl/blob/v1.3.0-1/src/broadcast.cu#L75

/cc slayton58
Closes https://github.com/caffe2/caffe2/pull/872

Differential Revision: D5329564

Pulled By: akyrola

fbshipit-source-id: 5a34be30c1a0046a74f28437cb08333c1fb46098
2017-06-27 09:47:48 -07:00
Jay Mahadeokar
04c9c8c5c2 fix for loading model with bmuf
Summary: - One line fix for loading saved checkpoint when using Parallelize_GPU_BMUF

Reviewed By: asaadaldien

Differential Revision: D5315254

fbshipit-source-id: a20ba6438c8e6b2ef44b65270c1d3f9ab645ded0
2017-06-23 17:16:33 -07:00
Thomas Dudziak
342de07231 Core unit test fixes for Python 3
Summary: As title

Differential Revision: D5291327

fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf
2017-06-23 13:22:16 -07:00
Ahmed Taei
5ca263fb1c Add a warmup option for BMUF
Reviewed By: yqwangustc

Differential Revision: D5279655

fbshipit-source-id: 7c778a88909580bbe43d4bac4b7d73be0d0e3f27
2017-06-22 14:32:39 -07:00
Ahmed Taei
ffd32c8ab7 Add distributed BMUF implementation.
Summary:
Refactor data_parallel_model all_reduce and broadcast methods to work for
a given parameter set not only gradients and reuse them for BMUF distributed
implementation.
Add a distributed test (multiprocessing) to BMUF.

Reviewed By: akyrola

Differential Revision: D5267083

fbshipit-source-id: 8dcc7527d0a755b903d693d8071585f0b54d3403
2017-06-21 16:18:11 -07:00
Aapo Kyrola
34eaa19d27 CPU data parallel model
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).

Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.

Reviewed By: wesolwsk

Differential Revision: D5277350

fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
2017-06-20 23:19:08 -07:00
Aapo Kyrola
96f19fefc0 add warning if data parallel model is created for gpus that we dont have
Summary: Don't want to assert since it can be useful to sometimes create models that are not run (for example, unit tests).

Reviewed By: pietern

Differential Revision: D5258905

fbshipit-source-id: f1beee0605bfef235ed0f23f7e78259109720254
2017-06-16 07:02:37 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Aapo Kyrola
5e6bd4fbfc Return predict params from ExtractPredictorNet + test
Summary:
Make it easier for users by returning from ExtractPredictorNet the list of blobs that must be saved/exported to run a predictor net. Added a test for ExtractPredictorNet

Codemod.

Reviewed By: asaadaldien

Differential Revision: D5176097

fbshipit-source-id: b1af42132459487b8d94fcdde0e4c514da608243
2017-06-05 15:34:37 -07:00
Andrey Malevich
a8fb85797c Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params.
Summary:
This diff is the first step in the effort for refactoring all parameters. As a first step - I'm merging concept of params and computed_params, that is going
to be based on tags instead (in the first version it's still using old data structs to store all the BlobReferences).

Renaming computed_params to non-trainable/non-backprop params should be done is some other diff.

Reviewed By: salexspb

Differential Revision: D5171159

fbshipit-source-id: 68031ca779f053fb266a7c4a2e5b482a3bd9c832
2017-06-02 17:17:57 -07:00
Simon Layton
58874ad5bf Fp16 training initializers
Summary:
Re-open for re-importing :)
Closes https://github.com/caffe2/caffe2/pull/721

Differential Revision: D5164345

Pulled By: akyrola

fbshipit-source-id: e80b32556cd25610602df91a4225b93edc0ca40b
2017-06-01 08:34:46 -07:00
Aapo Kyrola
0f8c8f37a8 Revert D5159712: [caffe2][PR] Fp16 training initializers
Summary: This reverts commit 60a889494d2e2f4df1d720331e19f638c5eb95cc

Differential Revision: D5159712

fbshipit-source-id: 16040c911b260648857f656f92b165f92c2daae0
2017-06-01 00:17:14 -07:00
Aapo Kyrola
076376f4f6 Revert D5119830: [C2] Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params
Summary: This reverts commit 2001090a37346eb12abbb234e13e727c288eb8a7

Differential Revision: D5119830

fbshipit-source-id: bf321868338f0db85dff3237af7eaf74212dbdf6
2017-06-01 00:02:21 -07:00
Andrey Malevich
ff61ed358e Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params
Summary:
This diff is the first step in the effort for refactoring all paramters. As a
first step - I'm merging concept of params and computed_params, that is going
to be based on tags instead (in the first version it's still using old data
structs to store all the BlobReferences).

Renaming computed_params to non-trainable/non-backprop params should be done is
some other diff.

Reviewed By: salexspb

Differential Revision: D5119830

fbshipit-source-id: 2001090a37346eb12abbb234e13e727c288eb8a7
2017-05-31 22:36:36 -07:00
Simon Layton
2bfacff426 Fp16 training initializers
Summary:
Adds support for generating and training pfp16 models. Added SGD optimizer for multi-precision trainers and a new callback to data_parallel_model in order to help multi-precision models keep their different copies of parameters in sync during training.
Closes https://github.com/caffe2/caffe2/pull/697

Differential Revision: D5159712

Pulled By: salexspb

fbshipit-source-id: 60a889494d2e2f4df1d720331e19f638c5eb95cc
2017-05-31 17:46:58 -07:00
Ahmed Taei
f0f4c2fc5d Increase the number of DAG execution worker threads.
Reviewed By: akyrola

Differential Revision: D5158414

fbshipit-source-id: add377aec5588076db881a2a3750101710f29732
2017-05-31 15:19:19 -07:00
Aapo Kyrola
73a8a49c7e synchronize re-rendezvousing on node changes + support num_shards=1 rendezvous
Summary:
Currently we can get into broken situations when some nodes working on computation detectChanges() faster than others, thus only some of the nodes start doing next iteration of training. This is an inconsistent state. To prevent this to happen, now each node sets a "re-rendezvous flag" and that is allreduced after each iteration. Once all agnodes agree, re-rendezvous will be done.

Also noticed that min_shards=1 does not work because data parallel model assumed num_shards>1 when rendezvous is not None. Fixed that.

Reviewed By: andrewwdye

Differential Revision: D5156282

fbshipit-source-id: f2ccbd8ad13ed37f7813ff8ad1080d963d0d17e3
2017-05-31 15:19:13 -07:00
Ahmed Taei
f2d9d97008 Add an option to reset momentum-sgd params every time between successive block updates.
Reviewed By: akyrola

Differential Revision: D5149263

fbshipit-source-id: c0a3637a1b48f74ec55c9d13c8fab3456dab809c
2017-05-31 00:32:11 -07:00
Simon Layton
1aa6300696 Option to use NCCL for broadcast
Summary:
Fixes some performance issues when `broadcast_computed_params=True` is passed to Parallelize_GPU. Enabled via the same `use_nccl` flag as AllReduce
Closes https://github.com/caffe2/caffe2/pull/630

Differential Revision: D5149828

Pulled By: akyrola

fbshipit-source-id: 12c9714c7fa078811f1cde61c8523dca8f7f968f
2017-05-30 16:46:38 -07:00
Aapo Kyrola
cdb50fbf2b add optimizer support to data_parallel_model; Use MomentumSGDUpdate
Summary:
This diff does two things:
- add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model.

- use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits.

Changes resnet50 trainer to use optimizer.

This relies on D5133652

Reviewed By: dzhulgakov

Differential Revision: D5142973

fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2
2017-05-30 12:49:57 -07:00
Luke Yeager
6b1cf26380 Fix for dpm when GPUs don't have p2p access
Summary:
See discussion at https://github.com/caffe2/caffe2/pull/633#issuecomment-303536902

Tested with a TitanX (Pascal) and a TitanZ (Kepler) with this access pattern.
```
Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU1) : No
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU2) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> GeForce GTX TITAN Z (GPU2) : Yes
> Peer access from GeForce GTX TITAN Z (GPU2) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU2) -> GeForce GTX TITAN Z (GPU1) : Yes
```
All combinations pass:
* `0,1`
* `0,2`
* `1,2`
* `0,1,2`
Closes https://github.com/caffe2/caffe2/pull/659

Differential Revision: D5148779

Pulled By: akyrola

fbshipit-source-id: 6263edfe8b36623983f1946b5c3f4a3fef415a45
2017-05-30 12:02:19 -07:00
Ahmed Taei
75a6f909c5 Add option to enable memonger for gradients and add param_names for save_model.
Reviewed By: akyrola

Differential Revision: D5131493

fbshipit-source-id: 7c159ccffa30eb064c157e559f1d8f0350f03ccb
2017-05-26 11:31:35 -07:00
Pieter Noordhuis
a9b5efe3c2 Expose max collective concurrency
Summary:
This was hardcoded at 4 before but should be made
configurable. Can be kept low for big MLPs and higher for convnets.

Reviewed By: akyrola

Differential Revision: D5126138

fbshipit-source-id: 713ee8bbeb243b7de1479808fd6398d397e0b49a
2017-05-25 13:32:40 -07:00
Deepak Gopinath
33c40e8a6e Handling shared indices in sparse gradient updates
Summary: When two or more blobs are gathered by the same indices blob in a data parallel model, we used to concatenate multiple times and re-write to the same indices blob. This leads to illegal memory access at times because the gradientslice indices blob is longer than its corresponding gradientslice values blob. This diff adds a check in order to avoid this.

Reviewed By: akyrola

Differential Revision: D5116817

fbshipit-source-id: 1c086d092eb6d48926d600f9408f578f5ddc41c7
2017-05-24 22:47:00 -07:00
Aapo Kyrola
a2c01e830b fix duplicate init blob issue + fix test
Summary:
Address KaimingHe's comments in D5093689 about same blob being initialized twice causing internal consistency check to fail. Also I noticed that my new test for test_checkpoint_params was completely botched due to an indentatino issue (it did not actually execute any test). So this fixes that as well.
 Modified the test to add a duplicate param initializer, so that this bug is tested for.

Reviewed By: KaimingHe

Differential Revision: D5101304

fbshipit-source-id: 72f343035c1b4953e7bb9a1a1c171cf05d3ead26
2017-05-20 09:18:29 -07:00
Aapo Kyrola
6384bae29b call save_to_db in CPUContext + fix a typo in data_parallel_model.
Summary:
If Predictor Exporter save_to_db is called in CUDAContext, a failure occurs since the following FeedBlob() tries to store a string (meta data), but for CUDA blobs we assume they are tensors.
  + fix a typo in data_parallel_model that I bumped on.

Reviewed By: asaadaldien

Differential Revision: D5099837

fbshipit-source-id: 69d01b35a9a1816bf083f13d8a6ce88e1f5aecb7
2017-05-19 18:25:00 -07:00
Aapo Kyrola
0af0cba2b7 Refactor data_parallel_model initial sync and checkpointing
Summary:
Major improvements. Before we only synced "params" and "computed params" of model after initialization and after loading a checkpoint. But actually we want to sync all blobs that are generated in the param_init_net. For example the _momentum blobs were missed by the previous implementation and had to be manually included in checkpoint finalization.

I also added GetCheckpointParams() to data_parallel_model because it is now fully general. Also added a unit test.

Reviewed By: andrewwdye

Differential Revision: D5093689

fbshipit-source-id: 8154ded0c73cd6a0f54ee024dc5f2c6826ed7e42
2017-05-19 12:48:06 -07:00
Aapo Kyrola
658c337f41 Error status for Gloo ops, and handling in elastic dpm
Summary: Add a RandomFailureOp and handling to elastic data parallel model of the status code

Reviewed By: andrewwdye

Differential Revision: D5065936

fbshipit-source-id: 24224f9ea414ee535c9e90cc28add5189354b0ef
2017-05-17 00:16:52 -07:00
Ahmed Taei
25fd005dd9 Initial implementation of Blockwise Model Update Filtering (BMUF)
Summary:
A Single machine multi-GPU version of BMUF algorithm. BMUF is a modification to
model averaging where updates to global model is implemented as a filter:
param_t = param_(t-1) + delta
delta = \beta delta_(t-1) + \alpha average(param_t) - param_(t-1)

Reviewed By: akyrola

Differential Revision: D4995057

fbshipit-source-id: 48176ba66d67eaf3fa4dee16d50d9589825ddba4
2017-05-15 18:18:15 -07:00
Aapo Kyrola
282298dd1c Data parallel model: Disable NCCL by default to hopefully reduce deadlocks
Summary: Make NCCL optional in data_parallel_model due to continuing reliablity (deadlock) issues.

Reviewed By: pietern

Differential Revision: D4988950

fbshipit-source-id: 8a2192f01b5f3c0e847137cd37aefc69e553a56f
2017-05-02 16:09:17 -07:00
Aapo Kyrola
f82a510be6 share forward activation blobs + pass unused free blobs down all branches + use shape infernece
Summary:
Added optional support for using activation blobs for sharing as well. Doing this change revealed an non-optimal implementation in the blob sharing: we need to prefer to reuse freeblobs by prefering those blobs that are already shared by many other blobs. Otherwise the memory usage can increase when the pool of 'free blobs' grows.

Also, my first version only passed "free blobs" (i.e blobs in recycling pool) down the first branch when operators forked. But now we pass those blobs that were not used by the first branch down the second branch and so on.

Also added support for blob size information in the heuristic. This uses the shape inference mechanism.

I had to also do some small tweaks:
- use Sum() operator as a way to match shapes of blobs that had otherwise unknown shapes. This is related to the Sum() operator that is added to combine multiple incoming gradient inputs (with _autosplit gradients).
- a couple of random shape inference fixes

This reduces the Resnet-50 memory usage on 64 batch from 9.45 Gig to 8.5 Gig.
For a 32 batch, the memory usage is 4330 MiB, down from 4800 MB, compared to Torch's 6856MiB (thanks prigoyal  for checking this for me).

This is unfortunately quite a bunch to review...

Reviewed By: asaadaldien

Differential Revision: D4393909

fbshipit-source-id: 9c7c94125f96512bea80463ebcb63c215ef95ff9
2017-04-25 14:23:25 -07:00
Xian Li
4c08d6ae3b Allow cpu-only grad update in Parallelize_GPU.
Summary: Instead of requiring gradient updates on GPU, this change will allow the usage when loss computation happens on GPU while all grad updates happen on CPU.

Reviewed By: jhcross

Differential Revision: D4943996

fbshipit-source-id: 1f2144c4277dfdb865877e0d0216ca1ac7dd7309
2017-04-24 18:47:36 -07:00
Yiming Wu
bef6e45f8b rename ModelHelperBase
Summary:
rename ModelHelperBase to Model.

This is the result of running:

  find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} +

We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py

Reviewed By: salexspb

Differential Revision: D4928337

fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3
2017-04-24 15:52:26 -07:00
Pieter Noordhuis
e0a904011b Use gradient name for allreduce op name
Summary: This may help tell different allreduce operations apart during debugging/tracing.

Reviewed By: prigoyal

Differential Revision: D4897921

fbshipit-source-id: bbb2ce02a3e1f467ad54f8a3aed6a4e2b26a9fe4
2017-04-17 23:31:27 -07:00
Pieter Noordhuis
ed1e342860 Reuse common world for allreduce/broadcast
Summary:
The common worlds can be reused without performance impact as long as
there is a guarantee that no two algorithm instances are using it at
any given time. Since we know the ordering and the maximum
parallelism, we can cycle through common worlds, and reuse them
accordingly.

Differential Revision: D4896779

fbshipit-source-id: 164e1727692eab904fa6879a9f91a3e8332a2e30
2017-04-17 23:31:26 -07:00
Aapo Kyrola
f94f43fd6e Working sparse gradients for data parallel model
Summary: This diff enables sparse gradient synchronization between GPUs. The test case is now a bit too convoluted, but once D4871680 is landed, we can simplify it a bit.

Reviewed By: dzhulgakov

Differential Revision: D4877087

fbshipit-source-id: 37bbb07051cbaf3a6e3c54b0eead97f3e02337d5
2017-04-13 17:39:23 -07:00
Aapo Kyrola
4967db0756 sanity checks for data parallel model
Summary: To help dgponinath, and people in general: check that params don't have duplicate entries.

Differential Revision: D4872132

fbshipit-source-id: 1cca1237fda771eb270227f452ecae0f912d7a33
2017-04-12 09:32:12 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
James Cross
79c3a3af54 add gpu support for caffe2-seq2seq
Summary: Adding synchronous optimization on GPUs to the translation training pipeline, via data_parallel_model.Parallelize_GPU, which needs to be updated so there is some way of performing sparse parameter updates (e.g., on embedding tables), whether on GPU or CPU.

Reviewed By: urikz

Differential Revision: D4631914

fbshipit-source-id: 9cdd655f7dbda3f9b2733d459228b3e097892441
2017-03-17 05:19:14 -07:00
Pieter Noordhuis
9e6fd02c28 Use Gloo ops in data_parallel_model
Summary:
No longer need GPU to CPU copies. The allreduce operator no longer
uses 'local allreduce - global allreduce - local broadcast' sequence
when Gloo is used, but passes all input blobs directly.

Depends on D4708860.

Differential Revision: D4709897

fbshipit-source-id: 4d745d5d8bac9c2fcca081dd5d812c902808c3b6
2017-03-14 22:34:51 -07:00
Aapo Kyrola
91f468b15c fixes to make data parallel model work for RecurrentNet + test case
Summary:
First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made:
 - cell net/step net external inputs must be namespace scoped
 - prevent double-namescoping of cellnet inputs
 - make data parallel model understand recurrentnets so the device-mapping works

Reviewed By: salexspb

Differential Revision: D4708840

fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4
2017-03-14 15:48:07 -07:00
Aapo Kyrola
fc7939c25b add model_helper.ExtractPredictorNet()
Summary:
It has been a pain to save predictor-compatible models from Caffe2. This diff adds function ExtractPredictorNet that takes a training model and outputs a predictor model by removing all operators that are not relevant for prediction, such as backward pass and dequeue-ops for input loading (as in predictor, the input data is external input).

We can also consider including this directly in the predictor exporter for FB usage.

Reviewed By: rpenggithub

Differential Revision: D4693264

fbshipit-source-id: e81abbbec0bd4d717159cf36488d0baaf0130090
2017-03-13 16:32:04 -07:00
Aapo Kyrola
3f682ca699 Fix to data parallel model blob_to_device mapping
Summary: We need the InferToDeviceMapping too early, or we should had done it also after running parameter update function since that can create new blobs like the momentum blobs. This fix is maybe not optimal, but works and is fast enough.

Differential Revision: D4693450

fbshipit-source-id: 4c4cc2396dad371b3fbcd1d8da51133ea09a57e0
2017-03-10 18:03:58 -08:00
Aapo Kyrola
a109cbdfb6 fix bug in data_parallel_model stripParams()
Summary: Thanks for shenpan, detected this bug. Problem is that FinalizeAfterCheckponit() can be passed a list of strings, not blob references, and that fails in stripParam() after assertion I added in D4649208. It is ok to pass strings as well to that function.

Reviewed By: jhcross

Differential Revision: D4691028

fbshipit-source-id: 0bca80d44a5ab641438cc5b26482bca0b1527d69
2017-03-10 13:17:11 -08:00
Aapo Kyrola
89c08334bb data_parallel_model support for sparse gradients and CPU ops
Summary:
Data parallel model did not support sparse operations, nor gradients computed on CPU ops.

Currently sparse operations are done on CPU, so there is no point of "data parallelizing" them. I had to make a few changes to data_parallel_model to support this:
 1. Model can have params that are added prior to adding the data parallel part. For example, a lookup table of word vectors would be a parameter that is non-parallel.
 2. Thus, when data parallel model is called, it will separate the non-parallel params and avoid working on them. Note: when we add distributed version, we need to explicitly handle them with AllGather!

This works nicely since Caffe2 automatically adds the backward concat-operator when multiple ops gather from the same blob.

I also added support for data parallel CPU ops, which might be necessary in cases when we don't have GPU implemenation of some ops.

Test in data_parallel_model_test validates the correctness of the code by running the same trainer on different number of gpus and checking the end result is same.

Reviewed By: jhcross

Differential Revision: D4649208

fbshipit-source-id: e3b7ae701ead468dc94c52a976eafec5c9831097
2017-03-09 13:48:41 -08:00
Pieter Noordhuis
c115646d71 Use fbcollective
Summary:
Update data parallel model to default to using fbcollective.

Update broadcast op to correctly handle Tensor<long>.

Differential Revision: D4508029

fbshipit-source-id: 7b8d17223e25b3e1098ee3f2a08af61af140729e
2017-02-07 10:48:33 -08:00
Priya Goyal
3c90356499 Add check for num_shards when using distributed training
Summary:
If num_shards = 1 and distributed training is on, then ring reduce fails when it looks for left pair to exchange information.
I also used the opportunity to do a small fix in my data loader benchmark

Differential Revision: D4513545

fbshipit-source-id: 7d3115b871a39b8ce7b55553394b607d16e08b74
2017-02-06 20:19:19 -08:00
Aapo Kyrola
3049bc1fed Fix data parallel model code doc
Summary: Thanks rpenggithub

Reviewed By: rpenggithub

Differential Revision: D4510933

fbshipit-source-id: 25e33ac0ba5a5143fc5bbe1abb615d7512c7ef41
2017-02-06 12:33:33 -08:00
Aapo Kyrola
1c7886701e lr_scale to loss_scale
Summary:
As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/.

So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this.

In this diff I modified all my models to work correctly.

Reviewed By: Yangqing

Differential Revision: D4507002

fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90
2017-02-03 07:44:40 -08:00
Priya Goyal
40ce50e0bd Speed-up training, fast data-augmentation, sync data_parallel_model changes + other small fixes
Summary:
1. Use opencv for data augmentation after benchmarking various image libraries in python
2. Use cuda no bias conv
3. Use cuda fastest conv (exhaustive search)
4. data_parallel_model had a few changes. Syncing them
3. propagate the errors in threads to make debugging easy

Reviewed By: rbgirshick

Differential Revision: D4341422

fbshipit-source-id: aa4471a2f49dd6d7ca13879999b3c7ceaf818c1e
2017-01-25 11:44:22 -08:00
Aapo Kyrola
b96c2ed6ab fix validation to consider cpu-only ops
Summary: Data paralell model has a sanity check that ensures that operators inputs/outputs do not cross device boundaries. This failed when the operator was a CPU-only operator (such as the new AccuracyOp version). This fixes that.

Reviewed By: prigoyal

Differential Revision: D4417841

fbshipit-source-id: 9bc4e7a2074a544ca4db69ecf24183bbd41f84ca
2017-01-13 18:59:32 -08:00
Aapo Kyrola
95b3309a87 Gradient Input memory sharing using memonger blob sharing
Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.

In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.

The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.

Module data_parallel_model supports this feature natively.

Reviewed By: prigoyal

Differential Revision: D4363209

fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1
2017-01-09 19:44:23 -08:00
Priya Goyal
29f903aaf2 Make computed params broadcast optional
Summary: this was introduced due to rm and riv params in SpatialBN layer and the likes. We should be saving these params as well but it is not required to broadcast these params to all gpus after every epoch.

Differential Revision: D4338749

fbshipit-source-id: d3bbc92cf0cd7d220a51d76aea8bffcfd6e520b7
2016-12-16 07:59:25 -08:00
Aapo Kyrola
fc27f83282 restore control_input
Summary: I accidentally landed in D4327024 the control_input disable for NCCL. This empirically increases likelihood of deadlocks, although gives a nice perf boost. But better to disable before NVIDIA fixes their stuff.

Reviewed By: Yangqing

Differential Revision: D4338537

fbshipit-source-id: d43efb45965a88bcfe38e5f1dc16c04463e2e038
2016-12-15 21:29:29 -08:00
Aapo Kyrola
2bf18f2b1d add inception and dummy input
Summary:
As requested by Yangqing, added Inception model (copied from convnet_benchmarks) and a dummy data feed option to the xray trainer, that we use for scalability benchmarking.

+ a couple of minichanges to the data input framework

Reviewed By: Yangqing

Differential Revision: D4327024

fbshipit-source-id: 86911468456fc13a32d5f437a43347380ec66a68
2016-12-15 13:40:22 -08:00
Priya Goyal
cb918ac727 Implementation of ResNets on imagenet dataset
Summary:
adding imagenet dataset as well
data augmentation and model has been added, just need to add db read

Differential Revision: D4289150

fbshipit-source-id: b531d3f09e3d0efac5cda5bb75d8146e1bb693e4
2016-12-15 12:01:31 -08:00
Aapo Kyrola
eddf23ca0f Handle parameters that are computed but not optimized
Summary:
prigoyal sharply noticed a bug in the Resnet models: we have not been checkpointing, nor synchronizing between gpus, the moving average and variance computed by the SpatialBN ops.  Particularly the first problen is serious, since models starting from checkpoint would have started from a null-state for SpatialBN. Not synchronizing with the data parallel model is less tragic since each GPU should see very similar data.

Thus I propose keeping track of "computed params", i.e params that are computed from data but not optimized. I don't know if there are other examples, but SpatialBN's moving avg and var definitely are one.

- I modified the checkpointign for xray model to store those blobs + also ensure the synchronization of those blobs
- I modified data parallel model to broadcast those params from gpu0. I first tried averaging, but hit some NCCL deadlocks ... :(

Differential Revision: D4281265

fbshipit-source-id: 933311afeec4b7e9344a13cf2d38aa939c50ac31
2016-12-15 12:01:28 -08:00
Byung-Gon Chun
1aba4280d8 Make xray net_type configurable
Summary: Make xray net_type configub a command line argument

Differential Revision: D4262076

fbshipit-source-id: e2ecb9cd5bee5d6aaebe0ea8d2d4d9b378058cba
2016-12-05 11:53:27 -08:00
Aapo Kyrola
96a5e88d63 Fix consequtive checkpoint syncs
Summary: Switching to Pieter-MPI changed the way we setup network between operators. For syncronizing parameters after a checkpoint load, we run a checkpoint_net that contaiend operators for creating the common world and broadcast operators. Unfortunately this fails when the checkpoint sync is done a second time, because we would have created a duplicate common world. Solution is to separate common world op and broadcast op to init net and the actual broadcasting net, and we run the init net only once. This problem did not arise in the Flow version since I did only one checkpoint loading per operator (process).

Differential Revision: D4251754

fbshipit-source-id: ba030579e651e529e29bbf2d27920075078d8ff9
2016-12-05 11:53:26 -08:00
Aapo Kyrola
3410939459 pass learning rate scaling factor to parameter update builder function
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.

Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.

Reviewed By: prigoyal

Differential Revision: D4248907

fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be
2016-12-05 11:53:26 -08:00
Aapo Kyrola
5d0167c8e7 Example workflow for running disributed (syncsgd) imagenet training in Flow
Summary:
This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members.

This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch.

Reviewed By: salexspb

Differential Revision: D4223384

fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb
2016-11-29 15:18:38 -08:00
Aapo Kyrola
365ca8da1c add sanity check that ops do not cross gpus
Summary: Debugging nets can be tiresome, so it is good if we can do some sanity checks. This adds a sanity check that all non-NCCL and non-Copy operators do not reference blobs that have different device scope than the operator. This check is only added to the data_parallel_model, so it should be safe. This check would had caught a subtle bugin prigoyal's training pipeline.

Reviewed By: dzhulgakov

Differential Revision: D4230444

fbshipit-source-id: 3d4a843162134a7a504053d95ff97a552e6b8a6d
2016-11-29 15:18:38 -08:00
Aapo Kyrola
42279a610c use Pieter-MPI and fb.distributed
Summary:
Remove MPI and use fb.distributed rendezvous and Pieter's new Ops.

One now can pass a 'rendezvous' struct to data_parallel_model to initiate distributed SyncSGD. Provided rendezvoud implementation uses the kv-store handler of fb.distributed to disseminate information about other hosts. We can easily add other rendezvous, such as file-based, but that is topic of another diff.

Removing MPI allowed also simplifiying of Xray startup scripts, which are included in this diff.

When accepted, I will work on a simple example code so others can use this stuff as well. Also Flow implementation will be topic of next week.

Differential Revision: D4180012

fbshipit-source-id: 9e74f1fb43eaf7d4bb3e5ac6718d76bef2dfd731
2016-11-29 15:18:36 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
44509f9f91 fbsync: mostly lint changes, added mkl files 2016-10-11 22:45:06 -07:00
Yangqing Jia
d1e9215184 fbsync 2016-10-07 13:08:53 -07:00