pytorch/caffe2/python
Pieter Noordhuis d43ab4bec5 Create Gloo common world through MPI rendezvous
Summary:
Before this change there were two ways for machines to rendezvous for a
distributed run: shared file system or Redis. If you're using an MPI
cluster it is much more convenient to simply execute mpirun and expect
the "right thing (tm)" to happen. This change adds the "mpi_rendezvous"
option to the CreateCommonWorld operator. If this is set, the common
world size and rank will be pulled from the MPI context and Gloo
rendezvous takes place using MPI. Note that this does NOT mean the MPI
BTL is used; MPI is only used for rendezvous.
Closes https://github.com/caffe2/caffe2/pull/1190

Reviewed By: akyrola

Differential Revision: D5796060

Pulled By: pietern

fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226
2017-09-08 17:18:47 -07:00
..
docs Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
examples Create Gloo common world through MPI rendezvous 2017-09-08 17:18:47 -07:00
helpers brew.concat: don't set both order and axis 2017-09-08 10:34:34 -07:00
layers Remove dot_product layer 2017-09-07 18:48:30 -07:00
mint
mkl Support grouped convolutions in MKL 2017-07-25 14:19:02 -07:00
modeling Scaled training and fetching from the PS 2017-08-23 18:16:03 -07:00
models rectify args btw. train and translate 2017-08-10 15:27:18 -07:00
operator_test fp16: SumSqrElements 2017-09-08 16:36:51 -07:00
predictor Allow specification of num_workers in PredictorExportMeta and enable for NMT beam search model 2017-09-07 22:48:45 -07:00
rnn Revert D5589309: modify _LSTM into _RNN to adapt GRU 2017-08-10 16:42:41 -07:00
_import_c_extension.py
allcompare_test.py Adding AllCompare-like function to data_parallel_model 2017-07-13 13:03:57 -07:00
attention.py soft-coverage attention 2017-08-31 21:21:54 -07:00
benchmark_generator.py A benchmark generator for individual ops 2017-08-31 17:33:21 -07:00
binarysize.py binary size util 2017-07-14 17:49:24 -07:00
brew_test.py Allow passing unsymmetric 2d kernels to brew.conv. 2017-08-10 15:27:16 -07:00
brew.py Layer norm brew wrapper 2017-08-17 11:17:47 -07:00
build.py Strip Operator Schema in mobile build 2017-08-22 13:31:08 -07:00
caffe_translator_test.py
caffe_translator.py Read pretrained weights using binary mode in caffe_translator.py 2017-07-08 10:17:57 -07:00
checkpoint_test.py Changes the checkpoint naming rules. 2017-08-17 22:16:42 -07:00
checkpoint.py Enable reader checkpoint 2017-09-05 14:21:25 -07:00
CMakeLists.txt
cnn.py cnnmodelhelper deprecate warning 2017-05-18 23:35:26 -07:00
context_test.py Add default implementation of __call__ for context manager 2017-08-22 17:46:22 -07:00
context.py Add default implementation of __call__ for context manager 2017-08-22 17:46:22 -07:00
control_ops_util.py Control flow operators 2017-08-28 20:04:43 -07:00
control_test.py
control.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
convnet_benchmarks_test.py
convnet_benchmarks.py brew API in convnet benchmark 2017-07-05 10:34:48 -07:00
core_gradients_test.py warn about orphan StopGradient output 2017-07-20 21:41:41 -07:00
core_test.py Support session in distributed realtime trainer 2017-08-16 10:28:55 -07:00
core.py Handle bool's correctly in net.Const 2017-08-31 12:02:58 -07:00
crf.py Deprecate CNNModelHelper in python/crf.py 2017-06-14 08:49:27 -07:00
data_parallel_model_test.py support device ids>10 2017-09-07 00:01:33 -07:00
data_parallel_model.py Create Gloo common world through MPI rendezvous 2017-09-08 17:18:47 -07:00
data_workers_test.py Caffe2: Refactor the core logic from data_workers.py into parallel_workers.py 2017-08-07 10:14:08 -07:00
data_workers.py Caffe2: Refactor the core logic from data_workers.py into parallel_workers.py 2017-08-07 10:14:08 -07:00
dataio_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
dataio.py Make piper of PipedReaderBuilder takes arguments 2017-09-08 13:46:29 -07:00
dataset.py Option to enforce batch size 2017-08-01 22:29:55 -07:00
db_test.py String-related fixes for Python 3 2017-05-26 16:04:32 -07:00
device_checker.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
dyndep.py
embedding_generation_benchmark.py Benchmark for embedding generation 2017-08-15 14:22:41 -07:00
empty.so Adding video data layer for caffe2 2017-05-05 14:16:38 -07:00
experiment_util.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
extension_loader.py
gradient_check_test.py Cos, Sin, and Abs operators 2017-07-03 22:18:32 -07:00
gradient_checker.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
gru_cell.py Revert D5589309: modify _LSTM into _RNN to adapt GRU 2017-08-10 16:42:41 -07:00
hsm_util.py
hypothesis_test_util.py Adding a range operator similar to np.arange 2017-08-18 14:45:56 -07:00
hypothesis_test.py EnsureDense/SparseToDense for CUDA 2017-09-01 09:33:05 -07:00
layer_model_helper.py Adding parameter sharing API to Dper2 2017-08-03 00:33:18 -07:00
layer_model_instantiator.py saving/loading CPU/GPU nets 2017-07-23 02:18:15 -07:00
layer_parameter_sharing_test.py Adding parameter sharing API to Dper2 2017-08-03 00:33:18 -07:00
layer_test_util.py Add a method to run a train net multiple times in layer_test_util.py 2017-07-28 19:56:05 -07:00
layers_test.py Create MergeIdListsLayer 2017-08-22 17:00:55 -07:00
lengths_reducer_rowwise_8bit_ops_test.py Rowwise quantization 2017-09-06 10:19:38 -07:00
load_save_test.py Allow Load operator to load into overriden names 2017-04-27 01:18:12 -07:00
lstm_benchmark.py enable setting rnn executor threads and max streams 2017-09-08 16:36:51 -07:00
memonger_test.py insert Free ops when blob used last time + memory allocation estimator 2017-09-05 12:03:04 -07:00
memonger.py insert Free ops when blob used last time + memory allocation estimator 2017-09-05 12:03:04 -07:00
mkl_test_util.py Implement a filler op test 2017-07-25 14:18:57 -07:00
model_device_test.py Deprecate CNNModelHelper in caffe2/python/model_device_test.py 2017-06-22 15:37:17 -07:00
model_helper.py YellowFin GPU class and Python optimizer 2017-08-30 18:32:24 -07:00
modifier_context.py add base class ModifierContext, rewrite OptimizerContext, add RegularizerContext 2017-09-08 11:39:23 -07:00
mpi_python.cc Fix pybind11 module name for MPI helpers 2017-05-02 23:18:50 -07:00
muji_test.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
muji.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
net_builder_test.py Control flow operators 2017-08-28 20:04:43 -07:00
net_builder.py Tuning number of parameter servers based on performance estimation job 2017-08-30 18:03:59 -07:00
net_drawer.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
net_printer_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
net_printer.py net_printer.to_string() accepts NetDef 2017-08-01 10:17:29 -07:00
optimizer_context.py add base class ModifierContext, rewrite OptimizerContext, add RegularizerContext 2017-09-08 11:39:23 -07:00
optimizer_test_util.py Added support for scaling learning rate of Caffe2 optimizers during training 2017-08-25 19:04:47 -07:00
optimizer_test.py Disabled test for equivalency between Caffe2's and Numpy's YellowFin 2017-09-06 13:47:45 -07:00
optimizer.py YellowFin GPU class and Python optimizer 2017-08-30 18:32:24 -07:00
parallel_workers_test.py Caffe2: Refactor the core logic from data_workers.py into parallel_workers.py 2017-08-07 10:14:08 -07:00
parallel_workers.py Caffe2 [easy]: Better exception logging in parallel_workers/data_workers 2017-08-10 15:27:19 -07:00
parallelize_gpu_bmuf_distributed_test.py Added Nesterov 2017-08-11 13:52:43 -07:00
pipeline_test.py Ability to dequeue and concat multiple records in a single QueueDequeue op 2017-08-31 10:48:59 -07:00
pipeline.py Enable runtime cloning of tasks. 2017-06-21 03:18:20 -07:00
predictor_constants.py Re-apply #266 2017-04-25 21:17:04 -07:00
pybind_state_gpu.cc
pybind_state_mkl.cc MKL code move 2017-07-26 20:21:55 -07:00
pybind_state.cc Strip Operator Schema in mobile build 2017-08-22 13:31:08 -07:00
pybind_state.h fast simple-net memonger for C++ 2017-07-06 15:17:07 -07:00
python_op_test.py Fix some typos 2017-06-28 13:50:48 -07:00
queue_util.py Ability to dequeue and concat multiple records in a single QueueDequeue op 2017-08-31 10:48:59 -07:00
record_queue.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
recurrent.py enable setting rnn executor threads and max streams 2017-09-08 16:36:51 -07:00
regularizer_context.py add base class ModifierContext, rewrite OptimizerContext, add RegularizerContext 2017-09-08 11:39:23 -07:00
regularizer.py add regulariztion in caffe2 and dper 2017-09-08 11:39:22 -07:00
rnn_cell.py Fix cell/hidden init issue, add copy states to test 2017-09-06 14:16:17 -07:00
schema_test.py Return empty Struct when get_field has empty input 2017-08-01 19:49:47 -07:00
schema.py logging the blob that has type error 2017-08-23 21:21:27 -07:00
scope_test.py Fix corruption of NameScope when exception is thrown 2017-04-24 22:46:27 -07:00
scope.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
session_test.py Warn on setting blob on Scalar 2017-05-01 20:18:30 -07:00
session.py Adds the master setup plan to the model exporter. 2017-08-25 16:01:24 -07:00
sparse_to_dense_mask_test.py Add more enforces to SparseToDenseMask operator. 2017-09-02 02:16:24 -07:00
task.py Support session in distributed realtime trainer 2017-08-16 10:28:55 -07:00
test_util.py Clear the operator default engines before running operator tests 2017-08-29 17:47:20 -07:00
text_file_reader.py
timeout_guard.py Revert D5655753: [Caffe2] better straggler exit procedure 2017-08-25 14:23:09 -07:00
toy_regression_test.py
tt_core_test.py
tt_core.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
utils.py Update proto definition 2017-08-22 19:01:18 -07:00
visualize.py Python 3 compatible integer division 2017-07-06 11:47:12 -07:00
workspace_test.py ApplyTransformIfFaster 2017-08-17 15:36:51 -07:00
workspace.py ApplyTransformIfFaster 2017-08-17 15:36:51 -07:00