Commit Graph

111 Commits

Author SHA1 Message Date
Lukasz Wesolowski
29a4c942fe Add support for multi-device batch normalization through an option to data_parallel_model
Summary: Stage 3 in stack of diffs for supporting multi-device batch normalization. Adds input parameter to data_parallel_model to enable multi-device batch normalization. Depends on D6699258.

Reviewed By: pietern

Differential Revision: D6700387

fbshipit-source-id: 24ed62915483fa4da9b1760eec0c1ab9a64b94f8
2018-01-24 13:24:06 -08:00
Aapo Kyrola
2caca70a37 Allow shifting of activations / ops to other GPUs in data parallel model
Summary:
(Work in progress). This diff will allow shifting of activations to other GPUs, in case the model does not fit into memory. To see the API, check the code in data_parallel_model_test, which tests shifting two activations from 0 and 1 to gpu 4, and from gpu 2 and 3 to gpu 5.

I will need to further test on ResNets, and probablly add copy operations to handle device change points.

Reviewed By: asaadaldien

Differential Revision: D5591674

fbshipit-source-id: eb12d23651a56d64fa4db91090c6474218705270
2017-11-29 21:17:00 -08:00
Matthias Ochs
14cc15e8f4 fixed NCCL bug in data_parallel_model.py
Summary:
Changed the dict of viewvalues into a python list

See issue: https://github.com/caffe2/caffe2/issues/1516
Closes https://github.com/caffe2/caffe2/pull/1532

Differential Revision: D6425901

Pulled By: akyrola

fbshipit-source-id: 37988abe29726aea86637e18eedb948b7c281008
2017-11-28 10:50:02 -08:00
Qinqing Zheng
4471e15b76 BMUF cpu support
Summary: change the interface so BMUF can run on cpus

Reviewed By: asaadaldien

Differential Revision: D6356026

fbshipit-source-id: f58a4da9f800d969145a1a376e118b0f3581f8c1
2017-11-19 23:41:25 -08:00
Aapo Kyrola
1a02e72254 fix missing DPM .values() and .keys() to viewvalues() and viewkeys()
Summary: Reported by SImon Layton from NVIDIA: we had a couple of py3-incompatible expresions in data_parallel_model

Reviewed By: azzolini

Differential Revision: D6349447

fbshipit-source-id: a09feb69396be43296400591a3bfed5b8c370b0d
2017-11-16 16:08:18 -08:00
Aapo Kyrola
e9cc41885e fix dynamic memory management for distributed execution
Summary: Dynamic memory management in Data Parallel Model was broken for distributed computation because it also the parameter gradients where freed after been used. That is problem with GLOO because it expects the tensors to have the same address over multiple calls. It is not a huge loss to remove parameter gradients from recycling as they are relatively small for typical convnets.

Reviewed By: asaadaldien

Differential Revision: D6314095

fbshipit-source-id: 949161d8c592927ae2fa82b3262b5f9ee47bed6f
2017-11-13 12:09:11 -08:00
Aapo Kyrola
cec27b8134 AddDistributedBlobsSync
Summary: Added a simple function to synchronize a blob across machines (but not across devices), i.e a blobs that are not synced over devices.

Reviewed By: yqwangustc

Differential Revision: D6192922

fbshipit-source-id: a4d653c9fb09f06b0c42330bdae07b42f5e6346c
2017-10-30 22:33:29 -07:00
Dmytro Dzhulgakov
2972a6ca02 Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger"
Summary:
This reverts commit 95c634872ac02be721257169e38c8fead04cd66b

bypass-lint

Differential Revision: D6026557

fbshipit-source-id: 663c28583ce3b01070ff5449115ed7e222f71776
2017-10-12 20:21:52 -07:00
Aapo Kyrola
d748c43f71 for dpm.GetLearningRateBlobNames
Summary:
I broke dpm.GetLearningRateBlobNames() when adding a new nodename param in optimizer.
Fixing it.

Reviewed By: asaadaldien

Differential Revision: D6043828

fbshipit-source-id: b3a79dd0dfae144187bcb359e2374eab6b32c485
2017-10-12 17:20:33 -07:00
Luke Yeager
75bece6ede Fix "No handlers could be found for logger"
Summary: Closes https://github.com/caffe2/caffe2/pull/1316

Differential Revision: D6026557

Pulled By: Yangqing

fbshipit-source-id: 95c634872ac02be721257169e38c8fead04cd66b
2017-10-10 22:32:13 -07:00
Andrey Malevich
e13f199452 Switch RNNOp to use NetDef argument for step represenetation.
Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well.

Reviewed By: salexspb

Differential Revision: D5949330

fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f
2017-10-10 22:01:51 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Luke Yeager
ec801d535c Fix typo in warning in data_parallel_model
Summary: Closes https://github.com/caffe2/caffe2/pull/1219

Differential Revision: D5898077

Pulled By: Yangqing

fbshipit-source-id: 7ee726ef3399a350a36e77093cbad0f70f8f3dce
2017-09-22 23:03:28 -07:00
Ahmed Taei
c3a3d6ceba Add an option to use dynamic memory optimizer.
Reviewed By: akyrola

Differential Revision: D5869664

fbshipit-source-id: ab11bc27395bf10e8381ebf97e6afb83ae9af81f
2017-09-20 12:52:55 -07:00
Aapo Kyrola
9ec981b866 for CPU-data parallel, allow sharing model
Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode.

Reviewed By: wesolwsk

Differential Revision: D5812181

fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea
2017-09-15 16:19:37 -07:00
Aapo Kyrola
ce36a972b0 fix timeouts in CloneOrCreateCommonWorld
Summary: Default value for timeout in CreateOrCloneCommonWorld does not work properly: if the value of dpm._DEFAULT_TIMEOUT is changed, the default still stays as old 30s. Changed to use None instead as default.

Reviewed By: pietern

Differential Revision: D5813228

fbshipit-source-id: f617ceec40a03893c27d3e13c426e1ca6b2114e2
2017-09-12 13:09:05 -07:00
Aapo Kyrola
93bd3c77f8 AddBlobsSync()
Summary: Explicit function to sync blobs. Notice that this must be called before CreateNet(), and syncs the blobs every run.

Reviewed By: asaadaldien, jay-mahadeokar

Differential Revision: D5805891

fbshipit-source-id: 58a1bb47805d75d5cbead136e2e0e9fe663ea954
2017-09-12 10:33:22 -07:00
Pieter Noordhuis
84167faf0f Enable use of GPUDirect through argument to Gloo AllreduceOp
Summary:
If the Gloo InfiniBand transport is used, the Gloo algorithms can use
GPUDirect to DMA directly from/to GPU memory. This is done through the
CudaDeviceWorkspace. This change adds a "gpu_direct" option to the
Allreduce operator that makes it use GPUDirect if the transport
supports it.
Closes https://github.com/caffe2/caffe2/pull/1203

Reviewed By: wesolwsk

Differential Revision: D5806366

Pulled By: pietern

fbshipit-source-id: 9e9a78f059f2b5c6e4fbf6574b7db4776a94696c
2017-09-11 13:02:58 -07:00
Pieter Noordhuis
d43ab4bec5 Create Gloo common world through MPI rendezvous
Summary:
Before this change there were two ways for machines to rendezvous for a
distributed run: shared file system or Redis. If you're using an MPI
cluster it is much more convenient to simply execute mpirun and expect
the "right thing (tm)" to happen. This change adds the "mpi_rendezvous"
option to the CreateCommonWorld operator. If this is set, the common
world size and rank will be pulled from the MPI context and Gloo
rendezvous takes place using MPI. Note that this does NOT mean the MPI
BTL is used; MPI is only used for rendezvous.
Closes https://github.com/caffe2/caffe2/pull/1190

Reviewed By: akyrola

Differential Revision: D5796060

Pulled By: pietern

fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226
2017-09-08 17:18:47 -07:00
Pieter Noordhuis
b8eb8ced7d Add transport/interface arguments to CreateCommonWorld operator
Summary:
These arguments control which Gloo transport (TCP or IB) and which
network interface is used for the common world. If not specified, it
defaults to using TCP and the network interface for the IP that the
machine's hostname resolves to.

The valid values for the transport argument are "tcp" and "ibverbs".
For ibverbs to work, Gloo must have been compiled with ibverbs
support. If Gloo is built as part of Caffe2 (sourced from the
third_party directory), then you can pass -DUSE_IBVERBS=ON to CMake to
enable ibverbs support in Gloo.
Closes https://github.com/caffe2/caffe2/pull/1177

Reviewed By: akyrola

Differential Revision: D5789729

Pulled By: pietern

fbshipit-source-id: 0dea1a115c729e54c5c1f9fdd5fb29c14a834a82
2017-09-08 10:57:41 -07:00
Aapo Kyrola
b7997a0f41 support device ids>10
Summary: Data parallel model failed with device numbers 10, 11.. because it used string sorting of the blob names. Changed to make sorting happen based on device number and then blob name. Also added reduction for 16 devices.

Reviewed By: wesolwsk

Differential Revision: D5781521

fbshipit-source-id: 16be0984ecb55340604c82893be366c0528e822c
2017-09-07 00:01:33 -07:00
Pieter Noordhuis
6d5c3eaeb7 Add CloneCommonWorld op
Summary:
Cloning was previously done by overloading CreateCommonWorld op.
Closes https://github.com/caffe2/caffe2/pull/1159

Reviewed By: andrewwdye

Differential Revision: D5757580

Pulled By: pietern

fbshipit-source-id: 9e80b295e390bf92623bafb72be21cbafdcf2ff4
2017-09-06 13:32:30 -07:00
Wojciech Glogowski
a7ec5def7b data_parallel_model names fix
Summary: Updated usage of deprecated functions in data_parallel_model.py

Reviewed By: akyrola

Differential Revision: D5738512

fbshipit-source-id: a7767e518da777ece058bcad480e5df1d91e9b42
2017-08-30 12:47:14 -07:00
Aapo Kyrola
7fad4be4c6 Device-specific memongering
Summary:
Enforce that blobs don't mix between operators on different GPUs or CPU/GPU. Add test.

+ Fix memonger when no namescope is provided.

Reviewed By: asaadaldien

Differential Revision: D5644708

fbshipit-source-id: 0cb361efd6361b6e2138462584bab6b4de039b5d
2017-08-17 13:31:26 -07:00
Alexander Sidorov
52befa4802 DataParallelModel: take param_init_net into account in _InferBlobDevice
Summary:
Here is my example:

For static RNN timestep is created as a part of param_init_net. Before DPM assumed that it is CUDA blob by default and it participated in broadcasting causing Copy on line 798 to fail. No device mapping is correct for this blob.

Reviewed By: akyrola

Differential Revision: D5631716

fbshipit-source-id: 28c3eb17ecc3080c95c41d69a60bf7262d3907d4
2017-08-15 12:06:46 -07:00
Zhaoming Wu
399fc9fb09 Added Nesterov
Summary: Added Nesterov momentum as an option for BMUF and corresponding tests

Reviewed By: asaadaldien

Differential Revision: D5599888

fbshipit-source-id: 30819c9e689347c8b75daddc7444bea9f54193ae
2017-08-11 13:52:43 -07:00
Priya Goyal
5c77cc8182 Exposing num_workers as parameter and enable recycling activations
Summary: as promised, a separate diff for dpm changes I made in experimental code

Reviewed By: pietern

Differential Revision: D5551304

fbshipit-source-id: 9013aeab6c388b1c415ffb2e36fb8dd6b8cf90b0
2017-08-08 19:48:41 -07:00
Ahmed Taei
647f35e742 Fix SyncAllParamsDistributed for Python 3x
Summary:
In Python 3x dictionary values aren't a list and can't be concatenated to a list
this diff should fix that.

Reviewed By: andrewwdye

Differential Revision: D5576724

fbshipit-source-id: c60441857ceceb9c4a71122d2db5e9abad6d3fc2
2017-08-07 14:23:32 -07:00
Aapo Kyrola
26645154bb warn about using test/val model with init_params=True + fixed some cases
Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...).

Reviewed By: asaadaldien

Differential Revision: D5509963

fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f
2017-07-27 13:20:27 -07:00
Aapo Kyrola
af1e45c1e1 support appending net and converting them
Summary:
As per rushabhmshah99 request: he wants to append a pre-trained model (without training that) to the model.
So added data_parallel_model.ConvertNetForDevice() to enable that. The unit test shows example how to use this with
AppendNet, and I also added a blurb to the function.

Differential Revision: D5503335

fbshipit-source-id: b2a5db5c1739dc97f46dd0d7606ed555d99255b8
2017-07-27 11:07:48 -07:00
Aapo Kyrola
3363681304 enable CreateCommonWorld to bootstrap from existing common world
Summary: Use romain-intel's ContextFactory to create common worlds from existing common worlds, thus bypassing KV store completely. Changed data_parallel_model to automatically find if there is already a CW we can work. CreateCommonWorldOp takes optional second parameter, which is existing CW.

Reviewed By: andrewwdye

Differential Revision: D5494956

fbshipit-source-id: 5f7a840bcd5fe4ea756fafeacc746bc2cf5078b0
2017-07-26 22:31:55 -07:00
Ahmed Taei
804ebf7c41 Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example.
Reviewed By: akyrola

Differential Revision: D5463772

fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d
2017-07-21 16:24:10 -07:00
Geet Sethi
11c4647447 Allow CPU device scope in data_parallel_model and data_parallel_rendevous device scope checks
Summary: Allowing CPU device scope instead of enforcing no device scope in data_parallel_model and data_parallel_rendevous.

Reviewed By: akyrola

Differential Revision: D5440492

fbshipit-source-id: bcd4344d64c710ea50ec8a65e3e9d102e35c66ea
2017-07-18 15:47:41 -07:00
Geet Sethi
ab0d631d6d Adding AllCompare-like function to data_parallel_model
Summary: Added function _RunComparison to data_parallel_model that checks if all shards in a given rendevous have the same value for a given blob_name

Reviewed By: wesolwsk

Differential Revision: D5394164

fbshipit-source-id: c2b07d0f8d5846fa9887d53b0be091a8c057f106
2017-07-13 13:03:57 -07:00
Geet Sethi
a68bb5e3f9 Added device scope checks to data_parallel_model and data_parallel_rendevous
Summary:
Added device scope checks to data_parallel_model and data_parallel_rendevous

Added test to check that checks are working correctly to data_parallel_model_test

Fixed device_scope error in test_synchronization_barrier

Reviewed By: akyrola

Differential Revision: D5403936

fbshipit-source-id: 849c1cd7452692efbc5ef74d2d60ede090c9c017
2017-07-12 10:47:28 -07:00
Ralph Mao
febae7b20b fix a bug in the report function of Data_Parallel
Summary: replace params with sp, otherwise it will report an empty list

Reviewed By: akyrola

Differential Revision: D5382716

fbshipit-source-id: 34d8e6ee00cbe1718702e3d1f23ea12f8d65063e
2017-07-07 13:03:46 -07:00
Andrew Dye
31f394f8b3 Add synchronization barrier API to data parallel model
Summary: Add synchronization barrier API with configurable timeout. Users can call Synchronize() to join variable length execution before resuming multi-machine communication steps, i.e., resuming distributed training iterations after validation on a single machine.

Reviewed By: akyrola

Differential Revision: D5348387

fbshipit-source-id: 5826da10e6a60c50394c36c7cf47624f10191d11
2017-07-06 09:21:19 -07:00
Aapo Kyrola
2d133d4627 increase concurrency default
Summary: Huge improvement in my tests, and it does not really hurt either.

Reviewed By: wesolwsk

Differential Revision: D5374925

fbshipit-source-id: c96a4ed2ca653120a82233c0037cbfded8a2d2a1
2017-07-05 21:46:31 -07:00
Simon Layton
090506ac87 Add NCCLBroadcast to correct net
Summary:
Otherwise was always added to main net instead of param_init_net when
desired (i.e. initial param sync)
Closes https://github.com/caffe2/caffe2/pull/894

Differential Revision: D5367451

Pulled By: akyrola

fbshipit-source-id: 3d82be6da687c736bd15f4852dbd272266eb4811
2017-07-03 16:54:44 -07:00
Aapo Kyrola
8c74c36626 fix reducing device option
Summary: This was broken in a previous diff, fixing it to use model device type.

Reviewed By: asaadaldien

Differential Revision: D5356005

fbshipit-source-id: a4fcc932bae772076b57625a5fcc0d38eb702cc9
2017-06-30 09:19:57 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Yongqiang Wang
ea659b8f2e broadcast to global parameters when using warmup
Reviewed By: asaadaldien, jay-mahadeokar

Differential Revision: D5340692

fbshipit-source-id: 80879847ff71c8d620de502ef95a9ffb4bdf595d
2017-06-28 13:35:27 -07:00
Ahmed Taei
fbe2526343 Allow concurrent execution of GLOO broadcast collectives in
Summary:
This add CollectivesConcurrencyControl class to mange creating common context and cyclic controls to execute GLOO collectivces
and refactors AllReduce and _AddDistributedParamterSync to use it

Reviewed By: akyrola

Differential Revision: D5335795

fbshipit-source-id: 5084e0a65cdb989cd949be3868b77a680561022d
2017-06-28 12:49:12 -07:00
Henry Lu
9a14c013c3 Refactor data_parallel_model to take advantage of Gloo broadcast op in broadcasting across machines and GPUs in one operation
Summary: Combine _AddDistributedParameterSync() and _SyncParams() into a single function to broadcast across distributes machines and all local GPU simultaneously. This is similar to how calls to Allreduce has already optimized using the functionalities of Gloo. All the refactoring work is contained in data_parallel_model.py.

Reviewed By: akyrola, andrewwdye

Differential Revision: D5329277

fbshipit-source-id: 4407b88980cf396f2e0f994d796294fa79fd39ed
2017-06-27 19:35:24 -07:00
Simon Layton
d45f722e43 data_parallel_model: NCCLBroadcast root fix
Summary:
The root is the root _rank_ and not the root _device_. Thus we always
use root=0, regardless of the devices used.

https://github.com/NVIDIA/nccl/blob/v1.3.0-1/src/broadcast.cu#L75

/cc slayton58
Closes https://github.com/caffe2/caffe2/pull/872

Differential Revision: D5329564

Pulled By: akyrola

fbshipit-source-id: 5a34be30c1a0046a74f28437cb08333c1fb46098
2017-06-27 09:47:48 -07:00
Jay Mahadeokar
04c9c8c5c2 fix for loading model with bmuf
Summary: - One line fix for loading saved checkpoint when using Parallelize_GPU_BMUF

Reviewed By: asaadaldien

Differential Revision: D5315254

fbshipit-source-id: a20ba6438c8e6b2ef44b65270c1d3f9ab645ded0
2017-06-23 17:16:33 -07:00
Thomas Dudziak
342de07231 Core unit test fixes for Python 3
Summary: As title

Differential Revision: D5291327

fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf
2017-06-23 13:22:16 -07:00
Ahmed Taei
5ca263fb1c Add a warmup option for BMUF
Reviewed By: yqwangustc

Differential Revision: D5279655

fbshipit-source-id: 7c778a88909580bbe43d4bac4b7d73be0d0e3f27
2017-06-22 14:32:39 -07:00
Ahmed Taei
ffd32c8ab7 Add distributed BMUF implementation.
Summary:
Refactor data_parallel_model all_reduce and broadcast methods to work for
a given parameter set not only gradients and reuse them for BMUF distributed
implementation.
Add a distributed test (multiprocessing) to BMUF.

Reviewed By: akyrola

Differential Revision: D5267083

fbshipit-source-id: 8dcc7527d0a755b903d693d8071585f0b54d3403
2017-06-21 16:18:11 -07:00
Aapo Kyrola
34eaa19d27 CPU data parallel model
Summary:
CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs).

Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later.

Reviewed By: wesolwsk

Differential Revision: D5277350

fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210
2017-06-20 23:19:08 -07:00