pytorch/caffe2/python
Aapo Kyrola eddf23ca0f Handle parameters that are computed but not optimized
Summary:
prigoyal sharply noticed a bug in the Resnet models: we have not been checkpointing, nor synchronizing between gpus, the moving average and variance computed by the SpatialBN ops.  Particularly the first problen is serious, since models starting from checkpoint would have started from a null-state for SpatialBN. Not synchronizing with the data parallel model is less tragic since each GPU should see very similar data.

Thus I propose keeping track of "computed params", i.e params that are computed from data but not optimized. I don't know if there are other examples, but SpatialBN's moving avg and var definitely are one.

- I modified the checkpointign for xray model to store those blobs + also ensure the synchronization of those blobs
- I modified data parallel model to broadcast those params from gpu0. I first tried averaging, but hit some NCCL deadlocks ... :(

Differential Revision: D4281265

fbshipit-source-id: 933311afeec4b7e9344a13cf2d38aa939c50ac31
2016-12-15 12:01:28 -08:00
..
examples LMDB example 2016-12-05 11:53:26 -08:00
layers dper_example use RowMul for speed 2016-12-15 12:01:28 -08:00
mint more build updates: 2016-08-02 23:28:23 -07:00
models Handle parameters that are computed but not optimized 2016-12-15 12:01:28 -08:00
operator_test Incremental MeanReducer segment Ops 2016-11-29 15:18:38 -08:00
_import_c_extension.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
BREW more build fixing 2016-09-07 23:30:35 -07:00
caffe_translator_test.py protected legacy_pad_, replace DeleteDropout with is_test=True 2016-07-29 11:44:55 -07:00
caffe_translator.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
cnn.py Handle parameters that are computed but not optimized 2016-12-15 12:01:28 -08:00
context.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
control_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
control.py Use native reader for evaluation 2016-12-05 11:53:27 -08:00
convnet_benchmarks_test.py chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
convnet_benchmarks.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
core_gradients_test.py automatic aggregation of sparse gradients 2016-12-05 11:53:26 -08:00
core_test.py Proper error message if passing NoneType value for kwargs 2016-11-29 15:18:36 -08:00
core.py Hacky fix for cloned model rewriting 2016-12-05 11:53:26 -08:00
data_parallel_model_test.py pass learning rate scaling factor to parameter update builder function 2016-12-05 11:53:26 -08:00
data_parallel_model.py Handle parameters that are computed but not optimized 2016-12-15 12:01:28 -08:00
dataio_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
dataio.py fbsync at f5a877 2016-11-18 15:41:06 -08:00
dataset.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
db_test.py Fix db_test under tsan 2016-11-29 15:18:37 -08:00
device_checker.py chunky sync 2016-09-06 15:55:19 -07:00
dyndep.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
experiment_util.py use Pieter-MPI and fb.distributed 2016-11-29 15:18:36 -08:00
extension_loader.py fbsync 2016-10-07 13:08:53 -07:00
gradient_check_test.py Fix few more operators to handle empty batches correctly. 2016-11-29 15:18:37 -08:00
gradient_checker.py fbsync 2016-10-07 13:08:53 -07:00
hsm_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
hsm_util.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
hypothesis_test_util.py fbsync at f5a877 2016-11-18 15:41:06 -08:00
hypothesis_test.py fix sliceop for empty batch 2016-11-29 15:18:39 -08:00
layer_model_helper.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
layer_model_instantiator.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
load_save_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
memonger_test.py add untracked files 2016-07-21 11:26:41 -07:00
memonger.py add untracked files 2016-07-21 11:26:41 -07:00
model_device_test.py sync 2016-07-28 15:06:43 -07:00
model_helper.py Handle parameters that are computed but not optimized 2016-12-15 12:01:28 -08:00
muji_test.py chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
muji.py fbsync 2016-10-07 13:08:53 -07:00
net_builder_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
net_builder.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
net_drawer.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
pipeline.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
pybind_state_gpu.cc fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
pybind_state_mkl.cc Expose MKLMemory to the Python Feed and Fetch interface, and misc changes 2016-11-29 15:18:36 -08:00
pybind_state.cc Allow PythonOp to access the workspace 2016-12-05 11:53:26 -08:00
pybind_state.h Allow PythonOp to access the workspace 2016-12-05 11:53:26 -08:00
python_op_test.py Allow PythonOp to access the workspace 2016-12-05 11:53:26 -08:00
queue_util.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
record_queue.py chunky sync 2016-09-06 15:55:19 -07:00
schema_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
schema.py fbsync at f5a877 2016-11-18 15:41:06 -08:00
scope_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
scope.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
session_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
session.py fbsync at f5a877 2016-11-18 15:41:06 -08:00
snapshot_test.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
snapshot.py fbsync at f5a877 2016-11-18 15:41:06 -08:00
sparse_to_dense_mask_test.py Fix few more operators to handle empty batches correctly. 2016-11-29 15:18:37 -08:00
task.py fbsync at f5a877 2016-11-18 15:41:06 -08:00
test_util.py chunky sync 2016-09-06 15:55:19 -07:00
text_file_reader.py fix race condition in text_file_reader.py 2016-11-29 15:18:36 -08:00
timeout_guard.py fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
toy_regression_test.py sync 2016-08-10 11:02:15 -07:00
tt_core_test.py sync 2016-08-10 11:02:15 -07:00
tt_core.py sync 2016-08-10 11:02:15 -07:00
utils.py Proper error message if passing NoneType value for kwargs 2016-11-29 15:18:36 -08:00
visualize.py chunky sync 2016-05-13 14:43:48 -07:00
workspace_test.py check that numpy arrays are float32 when CUDA is used 2016-11-29 15:18:37 -08:00
workspace.py pass learning rate scaling factor to parameter update builder function 2016-12-05 11:53:26 -08:00