Commit Graph

101 Commits

Author SHA1 Message Date
Andrey Malevich
e13f199452 Switch RNNOp to use NetDef argument for step represenetation.
Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well.

Reviewed By: salexspb

Differential Revision: D5949330

fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f
2017-10-10 22:01:51 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Luke Yeager
ec801d535c Fix typo in warning in data_parallel_model
Summary: Closes https://github.com/caffe2/caffe2/pull/1219

Differential Revision: D5898077

Pulled By: Yangqing

fbshipit-source-id: 7ee726ef3399a350a36e77093cbad0f70f8f3dce
2017-09-22 23:03:28 -07:00
Ahmed Taei
c3a3d6ceba Add an option to use dynamic memory optimizer.
Reviewed By: akyrola

Differential Revision: D5869664

fbshipit-source-id: ab11bc27395bf10e8381ebf97e6afb83ae9af81f
2017-09-20 12:52:55 -07:00
Aapo Kyrola
9ec981b866 for CPU-data parallel, allow sharing model
Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode.

Reviewed By: wesolwsk

Differential Revision: D5812181

fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea
2017-09-15 16:19:37 -07:00
Aapo Kyrola
ce36a972b0 fix timeouts in CloneOrCreateCommonWorld
Summary: Default value for timeout in CreateOrCloneCommonWorld does not work properly: if the value of dpm._DEFAULT_TIMEOUT is changed, the default still stays as old 30s. Changed to use None instead as default.

Reviewed By: pietern

Differential Revision: D5813228

fbshipit-source-id: f617ceec40a03893c27d3e13c426e1ca6b2114e2
2017-09-12 13:09:05 -07:00
Aapo Kyrola
93bd3c77f8 AddBlobsSync()
Summary: Explicit function to sync blobs. Notice that this must be called before CreateNet(), and syncs the blobs every run.

Reviewed By: asaadaldien, jay-mahadeokar

Differential Revision: D5805891

fbshipit-source-id: 58a1bb47805d75d5cbead136e2e0e9fe663ea954
2017-09-12 10:33:22 -07:00
Pieter Noordhuis
84167faf0f Enable use of GPUDirect through argument to Gloo AllreduceOp
Summary:
If the Gloo InfiniBand transport is used, the Gloo algorithms can use
GPUDirect to DMA directly from/to GPU memory. This is done through the
CudaDeviceWorkspace. This change adds a "gpu_direct" option to the
Allreduce operator that makes it use GPUDirect if the transport
supports it.
Closes https://github.com/caffe2/caffe2/pull/1203

Reviewed By: wesolwsk

Differential Revision: D5806366

Pulled By: pietern

fbshipit-source-id: 9e9a78f059f2b5c6e4fbf6574b7db4776a94696c
2017-09-11 13:02:58 -07:00
Pieter Noordhuis
d43ab4bec5 Create Gloo common world through MPI rendezvous
Summary:
Before this change there were two ways for machines to rendezvous for a
distributed run: shared file system or Redis. If you're using an MPI
cluster it is much more convenient to simply execute mpirun and expect
the "right thing (tm)" to happen. This change adds the "mpi_rendezvous"
option to the CreateCommonWorld operator. If this is set, the common
world size and rank will be pulled from the MPI context and Gloo
rendezvous takes place using MPI. Note that this does NOT mean the MPI
BTL is used; MPI is only used for rendezvous.
Closes https://github.com/caffe2/caffe2/pull/1190

Reviewed By: akyrola

Differential Revision: D5796060

Pulled By: pietern

fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226
2017-09-08 17:18:47 -07:00
Pieter Noordhuis
b8eb8ced7d Add transport/interface arguments to CreateCommonWorld operator
Summary:
These arguments control which Gloo transport (TCP or IB) and which
network interface is used for the common world. If not specified, it
defaults to using TCP and the network interface for the IP that the
machine's hostname resolves to.

The valid values for the transport argument are "tcp" and "ibverbs".
For ibverbs to work, Gloo must have been compiled with ibverbs
support. If Gloo is built as part of Caffe2 (sourced from the
third_party directory), then you can pass -DUSE_IBVERBS=ON to CMake to
enable ibverbs support in Gloo.
Closes https://github.com/caffe2/caffe2/pull/1177

Reviewed By: akyrola

Differential Revision: D5789729

Pulled By: pietern

fbshipit-source-id: 0dea1a115c729e54c5c1f9fdd5fb29c14a834a82
2017-09-08 10:57:41 -07:00
Aapo Kyrola
b7997a0f41 support device ids>10
Summary: Data parallel model failed with device numbers 10, 11.. because it used string sorting of the blob names. Changed to make sorting happen based on device number and then blob name. Also added reduction for 16 devices.

Reviewed By: wesolwsk

Differential Revision: D5781521

fbshipit-source-id: 16be0984ecb55340604c82893be366c0528e822c
2017-09-07 00:01:33 -07:00
Pieter Noordhuis
6d5c3eaeb7 Add CloneCommonWorld op
Summary:
Cloning was previously done by overloading CreateCommonWorld op.
Closes https://github.com/caffe2/caffe2/pull/1159

Reviewed By: andrewwdye

Differential Revision: D5757580

Pulled By: pietern

fbshipit-source-id: 9e80b295e390bf92623bafb72be21cbafdcf2ff4
2017-09-06 13:32:30 -07:00
Wojciech Glogowski
a7ec5def7b data_parallel_model names fix
Summary: Updated usage of deprecated functions in data_parallel_model.py

Reviewed By: akyrola

Differential Revision: D5738512

fbshipit-source-id: a7767e518da777ece058bcad480e5df1d91e9b42
2017-08-30 12:47:14 -07:00
Aapo Kyrola
7fad4be4c6 Device-specific memongering
Summary:
Enforce that blobs don't mix between operators on different GPUs or CPU/GPU. Add test.

+ Fix memonger when no namescope is provided.

Reviewed By: asaadaldien

Differential Revision: D5644708

fbshipit-source-id: 0cb361efd6361b6e2138462584bab6b4de039b5d
2017-08-17 13:31:26 -07:00
Alexander Sidorov
52befa4802 DataParallelModel: take param_init_net into account in _InferBlobDevice
Summary:
Here is my example:

For static RNN timestep is created as a part of param_init_net. Before DPM assumed that it is CUDA blob by default and it participated in broadcasting causing Copy on line 798 to fail. No device mapping is correct for this blob.

Reviewed By: akyrola

Differential Revision: D5631716

fbshipit-source-id: 28c3eb17ecc3080c95c41d69a60bf7262d3907d4
2017-08-15 12:06:46 -07:00
Zhaoming Wu
399fc9fb09 Added Nesterov
Summary: Added Nesterov momentum as an option for BMUF and corresponding tests

Reviewed By: asaadaldien

Differential Revision: D5599888

fbshipit-source-id: 30819c9e689347c8b75daddc7444bea9f54193ae
2017-08-11 13:52:43 -07:00
Priya Goyal
5c77cc8182 Exposing num_workers as parameter and enable recycling activations
Summary: as promised, a separate diff for dpm changes I made in experimental code

Reviewed By: pietern

Differential Revision: D5551304

fbshipit-source-id: 9013aeab6c388b1c415ffb2e36fb8dd6b8cf90b0
2017-08-08 19:48:41 -07:00
Ahmed Taei
647f35e742 Fix SyncAllParamsDistributed for Python 3x
Summary:
In Python 3x dictionary values aren't a list and can't be concatenated to a list
this diff should fix that.

Reviewed By: andrewwdye

Differential Revision: D5576724

fbshipit-source-id: c60441857ceceb9c4a71122d2db5e9abad6d3fc2
2017-08-07 14:23:32 -07:00
Aapo Kyrola
26645154bb warn about using test/val model with init_params=True + fixed some cases
Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...).

Reviewed By: asaadaldien

Differential Revision: D5509963

fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f
2017-07-27 13:20:27 -07:00
Aapo Kyrola
af1e45c1e1 support appending net and converting them
Summary:
As per rushabhmshah99 request: he wants to append a pre-trained model (without training that) to the model.
So added data_parallel_model.ConvertNetForDevice() to enable that. The unit test shows example how to use this with
AppendNet, and I also added a blurb to the function.

Differential Revision: D5503335

fbshipit-source-id: b2a5db5c1739dc97f46dd0d7606ed555d99255b8
2017-07-27 11:07:48 -07:00
Aapo Kyrola
3363681304 enable CreateCommonWorld to bootstrap from existing common world
Summary: Use romain-intel's ContextFactory to create common worlds from existing common worlds, thus bypassing KV store completely. Changed data_parallel_model to automatically find if there is already a CW we can work. CreateCommonWorldOp takes optional second parameter, which is existing CW.

Reviewed By: andrewwdye

Differential Revision: D5494956

fbshipit-source-id: 5f7a840bcd5fe4ea756fafeacc746bc2cf5078b0
2017-07-26 22:31:55 -07:00
Ahmed Taei
804ebf7c41 Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example.
Reviewed By: akyrola

Differential Revision: D5463772

fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d
2017-07-21 16:24:10 -07:00
Geet Sethi
11c4647447 Allow CPU device scope in data_parallel_model and data_parallel_rendevous device scope checks
Summary: Allowing CPU device scope instead of enforcing no device scope in data_parallel_model and data_parallel_rendevous.

Reviewed By: akyrola

Differential Revision: D5440492

fbshipit-source-id: bcd4344d64c710ea50ec8a65e3e9d102e35c66ea
2017-07-18 15:47:41 -07:00
Geet Sethi
ab0d631d6d Adding AllCompare-like function to data_parallel_model
Summary: Added function _RunComparison to data_parallel_model that checks if all shards in a given rendevous have the same value for a given blob_name

Reviewed By: wesolwsk

Differential Revision: D5394164

fbshipit-source-id: c2b07d0f8d5846fa9887d53b0be091a8c057f106
2017-07-13 13:03:57 -07:00
Geet Sethi
a68bb5e3f9 Added device scope checks to data_parallel_model and data_parallel_rendevous
Summary:
Added device scope checks to data_parallel_model and data_parallel_rendevous

Added test to check that checks are working correctly to data_parallel_model_test

Fixed device_scope error in test_synchronization_barrier

Reviewed By: akyrola

Differential Revision: D5403936

fbshipit-source-id: 849c1cd7452692efbc5ef74d2d60ede090c9c017
2017-07-12 10:47:28 -07:00
Ralph Mao
febae7b20b fix a bug in the report function of Data_Parallel
Summary: replace params with sp, otherwise it will report an empty list

Reviewed By: akyrola

Differential Revision: D5382716

fbshipit-source-id: 34d8e6ee00cbe1718702e3d1f23ea12f8d65063e
2017-07-07 13:03:46 -07:00
Andrew Dye
31f394f8b3 Add synchronization barrier API to data parallel model
Summary: Add synchronization barrier API with configurable timeout. Users can call Synchronize() to join variable length execution before resuming multi-machine communication steps, i.e., resuming distributed training iterations after validation on a single machine.

Reviewed By: akyrola

Differential Revision: D5348387

fbshipit-source-id: 5826da10e6a60c50394c36c7cf47624f10191d11
2017-07-06 09:21:19 -07:00
Aapo Kyrola
2d133d4627 increase concurrency default
Summary: Huge improvement in my tests, and it does not really hurt either.

Reviewed By: wesolwsk

Differential Revision: D5374925

fbshipit-source-id: c96a4ed2ca653120a82233c0037cbfded8a2d2a1
2017-07-05 21:46:31 -07:00
Simon Layton
090506ac87 Add NCCLBroadcast to correct net
Summary:
Otherwise was always added to main net instead of param_init_net when
desired (i.e. initial param sync)
Closes https://github.com/caffe2/caffe2/pull/894

Differential Revision: D5367451

Pulled By: akyrola

fbshipit-source-id: 3d82be6da687c736bd15f4852dbd272266eb4811
2017-07-03 16:54:44 -07:00
Aapo Kyrola
8c74c36626 fix reducing device option
Summary: This was broken in a previous diff, fixing it to use model device type.

Reviewed By: asaadaldien

Differential Revision: D5356005

fbshipit-source-id: a4fcc932bae772076b57625a5fcc0d38eb702cc9
2017-06-30 09:19:57 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Yongqiang Wang
ea659b8f2e broadcast to global parameters when using warmup
Reviewed By: asaadaldien, jay-mahadeokar

Differential Revision: D5340692

fbshipit-source-id: 80879847ff71c8d620de502ef95a9ffb4bdf595d
2017-06-28 13:35:27 -07:00
Ahmed Taei
fbe2526343 Allow concurrent execution of GLOO broadcast collectives in
Summary:
This add CollectivesConcurrencyControl class to mange creating common context and cyclic controls to execute GLOO collectivces
and refactors AllReduce and _AddDistributedParamterSync to use it

Reviewed By: akyrola

Differential Revision: D5335795

fbshipit-source-id: 5084e0a65cdb989cd949be3868b77a680561022d
2017-06-28 12:49:12 -07:00
Henry Lu
9a14c013c3 Refactor data_parallel_model to take advantage of Gloo broadcast op in broadcasting across machines and GPUs in one operation
Summary: Combine _AddDistributedParameterSync() and _SyncParams() into a single function to broadcast across distributes machines and all local GPU simultaneously. This is similar to how calls to Allreduce has already optimized using the functionalities of Gloo. All the refactoring work is contained in data_parallel_model.py.

Reviewed By: akyrola, andrewwdye

Differential Revision: D5329277

fbshipit-source-id: 4407b88980cf396f2e0f994d796294fa79fd39ed
2017-06-27 19:35:24 -07:00
Simon Layton
d45f722e43 data_parallel_model: NCCLBroadcast root fix
Summary:
The root is the root _rank_ and not the root _device_. Thus we always
use root=0, regardless of the devices used.

https://github.com/NVIDIA/nccl/blob/v1.3.0-1/src/broadcast.cu#L75

/cc slayton58
Closes https://github.com/caffe2/caffe2/pull/872

Differential Revision: D5329564

Pulled By: akyrola

fbshipit-source-id: 5a34be30c1a0046a74f28437cb08333c1fb46098
2017-06-27 09:47:48 -07:00
Jay Mahadeokar
04c9c8c5c2 fix for loading model with bmuf
Summary: - One line fix for loading saved checkpoint when using Parallelize_GPU_BMUF

Reviewed By: asaadaldien

Differential Revision: D5315254

fbshipit-source-id: a20ba6438c8e6b2ef44b65270c1d3f9ab645ded0
2017-06-23 17:16:33 -07:00
Thomas Dudziak
342de07231 Core unit test fixes for Python 3
Summary: As title

Differential Revision: D5291327

fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf
2017-06-23 13:22:16 -07:00
Ahmed Taei
5ca263fb1c Add a warmup option for BMUF
Reviewed By: yqwangustc

Differential Revision: D5279655

fbshipit-source-id: 7c778a88909580bbe43d4bac4b7d73be0d0e3f27
2017-06-22 14:32:39 -07:00
Ahmed Taei
ffd32c8ab7 Add distributed BMUF implementation.
Summary:
Refactor data_parallel_model all_reduce and broadcast methods to work for
a given parameter set not only gradients and reuse them for BMUF distributed
implementation.
Add a distributed test (multiprocessing) to BMUF.

Reviewed By: akyrola

Differential Revision: D5267083

fbshipit-source-id: 8dcc7527d0a755b903d693d8071585f0b54d3403
2017-06-21 16:18:11 -07:00
Aapo Kyrola
34eaa19d27 CPU data parallel model
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).

Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.

Reviewed By: wesolwsk

Differential Revision: D5277350

fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
2017-06-20 23:19:08 -07:00
Aapo Kyrola
96f19fefc0 add warning if data parallel model is created for gpus that we dont have
Summary: Don't want to assert since it can be useful to sometimes create models that are not run (for example, unit tests).

Reviewed By: pietern

Differential Revision: D5258905

fbshipit-source-id: f1beee0605bfef235ed0f23f7e78259109720254
2017-06-16 07:02:37 -07:00
Thomas Dudziak
60c78d6160 Fixes range/xrange for Python 3
Summary: As title

Differential Revision: D5151894

fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638
2017-06-07 00:04:26 -07:00
Aapo Kyrola
5e6bd4fbfc Return predict params from ExtractPredictorNet + test
Summary:
Make it easier for users by returning from ExtractPredictorNet the list of blobs that must be saved/exported to run a predictor net. Added a test for ExtractPredictorNet

Codemod.

Reviewed By: asaadaldien

Differential Revision: D5176097

fbshipit-source-id: b1af42132459487b8d94fcdde0e4c514da608243
2017-06-05 15:34:37 -07:00
Andrey Malevich
a8fb85797c Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params.
Summary:
This diff is the first step in the effort for refactoring all parameters. As a first step - I'm merging concept of params and computed_params, that is going
to be based on tags instead (in the first version it's still using old data structs to store all the BlobReferences).

Renaming computed_params to non-trainable/non-backprop params should be done is some other diff.

Reviewed By: salexspb

Differential Revision: D5171159

fbshipit-source-id: 68031ca779f053fb266a7c4a2e5b482a3bd9c832
2017-06-02 17:17:57 -07:00
Simon Layton
58874ad5bf Fp16 training initializers
Summary:
Re-open for re-importing :)
Closes https://github.com/caffe2/caffe2/pull/721

Differential Revision: D5164345

Pulled By: akyrola

fbshipit-source-id: e80b32556cd25610602df91a4225b93edc0ca40b
2017-06-01 08:34:46 -07:00
Aapo Kyrola
0f8c8f37a8 Revert D5159712: [caffe2][PR] Fp16 training initializers
Summary: This reverts commit 60a889494d2e2f4df1d720331e19f638c5eb95cc

Differential Revision: D5159712

fbshipit-source-id: 16040c911b260648857f656f92b165f92c2daae0
2017-06-01 00:17:14 -07:00
Aapo Kyrola
076376f4f6 Revert D5119830: [C2] Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params
Summary: This reverts commit 2001090a37346eb12abbb234e13e727c288eb8a7

Differential Revision: D5119830

fbshipit-source-id: bf321868338f0db85dff3237af7eaf74212dbdf6
2017-06-01 00:02:21 -07:00
Andrey Malevich
ff61ed358e Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params
Summary:
This diff is the first step in the effort for refactoring all paramters. As a
first step - I'm merging concept of params and computed_params, that is going
to be based on tags instead (in the first version it's still using old data
structs to store all the BlobReferences).

Renaming computed_params to non-trainable/non-backprop params should be done is
some other diff.

Reviewed By: salexspb

Differential Revision: D5119830

fbshipit-source-id: 2001090a37346eb12abbb234e13e727c288eb8a7
2017-05-31 22:36:36 -07:00
Simon Layton
2bfacff426 Fp16 training initializers
Summary:
Adds support for generating and training pfp16 models. Added SGD optimizer for multi-precision trainers and a new callback to data_parallel_model in order to help multi-precision models keep their different copies of parameters in sync during training.
Closes https://github.com/caffe2/caffe2/pull/697

Differential Revision: D5159712

Pulled By: salexspb

fbshipit-source-id: 60a889494d2e2f4df1d720331e19f638c5eb95cc
2017-05-31 17:46:58 -07:00
Ahmed Taei
f0f4c2fc5d Increase the number of DAG execution worker threads.
Reviewed By: akyrola

Differential Revision: D5158414

fbshipit-source-id: add377aec5588076db881a2a3750101710f29732
2017-05-31 15:19:19 -07:00