pytorch/caffe2/python
Alisson Gusatti Azzolini c24dabb414 Enable runtime cloning of tasks.
Summary:
Funnily, the biggest issue when trying to increase number of trainers from 5 to 20 is not model convergence (it is worse but still converges without tuning); it is the initialization time: it took around 30 min to generate the job.

After this diff, job creation time for the standard 5-7 setup goes from 125s to 8s. (15x speedup).

Another improvement is that ##net_printer.to_string(job)## becomes less complex.

This makes the startup for 20 trainers go to 32s, which is still not ideal.

Next step will be to allow passing num_instances to Node as well. This way we'll be able to create only one reader and one trainer prototype and let the framework take care of the scheduling. For this one we will need to move some DataStream and PS initialization code to C++ first. (c.c. aartibasant)

Reviewed By: dzhulgakov

Differential Revision: D5100788

fbshipit-source-id: 7b76bce108f527a96b2bfe7ed43a22ea8679b682
2017-06-21 03:18:20 -07:00
..
docs fixed operators schema output to work from only this file for OSS 2017-06-02 13:47:25 -07:00
examples CPU data parallel model 2017-06-20 23:19:08 -07:00
helpers fix #790 so model.init_params = False takes effect 2017-06-20 14:08:35 -07:00
layers add rekey in feature_processor 2017-06-20 23:19:09 -07:00
mint doxygen python block added 2017-03-29 06:46:16 -07:00
mkl Deprecate CNNModelHelper - Inception() 2017-06-15 14:03:27 -07:00
modeling Skip fp16 initializer test for CPU-only builds 2017-06-19 12:21:25 -07:00
models improve blob sharing 2017-06-20 12:08:57 -07:00
operator_test Don't require pydot for Python tests 2017-06-19 23:02:00 -07:00
predictor Allow specifying device to load_from_db() 2017-06-14 14:32:24 -07:00
rnn CuDNN comparison mode 2017-05-20 15:19:43 -07:00
_import_c_extension.py doxygen python block added 2017-03-29 06:46:16 -07:00
attention.py use brew for Tranpose --> major perf regression fix 2017-06-16 11:02:48 -07:00
brew_test.py Fixing broken Python tests 2017-06-08 13:34:46 -07:00
brew.py add_helper_function_ElementwiseLinear_op 2017-06-07 13:49:48 -07:00
caffe_translator_test.py Allow test discovery in caffe2/python/ 2017-03-14 18:16:41 -07:00
caffe_translator.py Fixed typo caffe_translator.py, fixes bug #397 2017-05-24 12:18:32 -07:00
checkpoint_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
checkpoint.py Adds interfaces to check the existence of a DB 2017-04-11 14:07:49 -07:00
CMakeLists.txt CMake completions work 2017-01-11 16:59:22 -08:00
cnn.py cnnmodelhelper deprecate warning 2017-05-18 23:35:26 -07:00
context_test.py Make ContextManager thread-safe 2017-02-13 19:45:35 -08:00
context.py doxygen python block added 2017-03-29 06:46:16 -07:00
control_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
control.py doxygen python block added 2017-03-29 06:46:16 -07:00
convnet_benchmarks_test.py chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
convnet_benchmarks.py doxygen python block added 2017-03-29 06:46:16 -07:00
core_gradients_test.py Revert uuid change to OperatorDef protobuf 2017-06-19 16:47:31 -07:00
core_test.py Revert uuid change to OperatorDef protobuf 2017-06-19 16:47:31 -07:00
core.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
crf.py Deprecate CNNModelHelper in python/crf.py 2017-06-14 08:49:27 -07:00
data_parallel_model_test.py CPU data parallel model 2017-06-20 23:19:08 -07:00
data_parallel_model.py CPU data parallel model 2017-06-20 23:19:08 -07:00
data_workers_test.py add test for input order 2017-05-19 23:46:38 -07:00
data_workers.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
dataio_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
dataio.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
dataset.py Add random shuffle through the data to the benchmark workflow 2017-06-16 13:22:46 -07:00
db_test.py String-related fixes for Python 3 2017-05-26 16:04:32 -07:00
device_checker.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
dyndep.py doxygen python block added 2017-03-29 06:46:16 -07:00
empty.so Adding video data layer for caffe2 2017-05-05 14:16:38 -07:00
experiment_util.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
extension_loader.py Make extension loader properly handle visibility. 2017-03-30 14:38:38 -07:00
gradient_check_test.py Don't require pydot for Python tests 2017-06-19 23:02:00 -07:00
gradient_checker.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
hsm_util.py doxygen python block added 2017-03-29 06:46:16 -07:00
hypothesis_test_util.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
hypothesis_test.py Fix entropy error coming from test_div 2017-06-19 13:47:29 -07:00
layer_model_helper.py add rekey in feature_processor 2017-06-20 23:19:09 -07:00
layer_model_instantiator.py Remove map() and filter() in favor of comprehensions 2017-05-30 15:32:58 -07:00
layer_test_util.py Add batch normalization layer 2017-05-26 16:46:52 -07:00
layers_test.py Building dropout as layer 2017-06-19 14:46:52 -07:00
load_save_test.py Allow Load operator to load into overriden names 2017-04-27 01:18:12 -07:00
lstm_benchmark.py Static RNN: gpu support and lstm_benchmark integration 2017-06-16 11:31:43 -07:00
memonger_test.py improve blob sharing 2017-06-20 12:08:57 -07:00
memonger.py improve blob sharing 2017-06-20 12:08:57 -07:00
mkl_test_util.py doxygen python block added 2017-03-29 06:46:16 -07:00
model_device_test.py Comment out NHWC Alexnet test for now 2017-01-23 13:59:29 -08:00
model_helper.py use brew for Tranpose --> major perf regression fix 2017-06-16 11:02:48 -07:00
mpi_python.cc Fix pybind11 module name for MPI helpers 2017-05-02 23:18:50 -07:00
muji_test.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
muji.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
net_builder_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
net_builder.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
net_drawer.py net_drawer: --input is required 2017-05-04 11:45:57 -07:00
net_printer_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
net_printer.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
optimizer_test_util.py Fp16 training initializers 2017-06-01 08:34:46 -07:00
optimizer_test.py add_weight_decay + restore weight decay to resnet50_trainer 2017-06-02 14:16:56 -07:00
optimizer.py Fix multi-precision SGD outputs 2017-06-14 11:36:43 -07:00
pipeline.py Enable runtime cloning of tasks. 2017-06-21 03:18:20 -07:00
predictor_constants.py Re-apply #266 2017-04-25 21:17:04 -07:00
pybind_state_gpu.cc Cudnn v6 2017-02-28 17:46:33 -08:00
pybind_state_mkl.cc Expose MKLMemory to the Python Feed and Fetch interface, and misc changes 2016-11-29 15:18:36 -08:00
pybind_state.cc Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
pybind_state.h Run python op builder at op creation time 2017-06-13 16:29:22 -07:00
python_op_test.py Run python op builder at op creation time 2017-06-13 16:29:22 -07:00
queue_util.py doxygen python block added 2017-03-29 06:46:16 -07:00
record_queue.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
recurrent.py Static RNN 2017-06-08 17:48:48 -07:00
rnn_cell.py Static RNN: gpu support and lstm_benchmark integration 2017-06-16 11:31:43 -07:00
schema_test.py Fix from_column_list 2017-06-06 01:17:02 -07:00
schema.py Fixing a small bug in schema where the number of default arguments doesn't match the number of fields 2017-06-15 10:31:56 -07:00
scope_test.py Fix corruption of NameScope when exception is thrown 2017-04-24 22:46:27 -07:00
scope.py quick fix future issue with brew/core/schema/workspace/scope/utils.py 2017-06-05 12:01:48 -07:00
session_test.py Warn on setting blob on Scalar 2017-05-01 20:18:30 -07:00
session.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
sparse_to_dense_mask_test.py String-related fixes for Python 3 2017-05-26 16:04:32 -07:00
task.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
test_util.py doxygen python block added 2017-03-29 06:46:16 -07:00
text_file_reader.py doxygen python block added 2017-03-29 06:46:16 -07:00
timeout_guard.py doxygen python block added 2017-03-29 06:46:16 -07:00
toy_regression_test.py sync 2016-08-10 11:02:15 -07:00
tt_core_test.py sync 2016-08-10 11:02:15 -07:00
tt_core.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
utils.py Misc fixes for Python 3 2017-06-13 12:18:43 -07:00
visualize.py doxygen python block added 2017-03-29 06:46:16 -07:00
workspace_test.py Deprecate CNNModelHelper in python/workspace_test.py 2017-06-15 14:17:18 -07:00
workspace.py Revert uuid change to OperatorDef protobuf 2017-06-19 16:47:31 -07:00