pytorch/caffe2/python
Devesh Agrawal 16549ed92b Scaled training and fetching from the PS
Summary:
Today, the PS's weirdly store the entire embedding and not just their
subsection of it. This was simply an oversight on the part of the original
author and this diff fixes that.

The sparse params are sharded to the PS's and the PS's just store their section
of the embedding. The trainer requests the id's as is from the PS. But the PS
divides the id by the num_of_shards before looking it up in the emdedding table
blob.  This happens on the backward and the forward pass. However, during the
model download part, the PS multiples the embeddings with the num_of_shards
before returning them to the trainer. The upshot is that the trainer does not
know anything about how the embeddings are scaled on the PS. The PS adds extra
divide and multiply steps to achieve that.

2. During estimation time, we allocate just one PS for estimation. So in order
to make all of the embeddings fit on the single PS: We simply additionally
scale the hash table sizes (proportionally and equally for all the sparse
params) such that it fits. This scaling is handled analogously to (1).

Reviewed By: boryiingsu

Differential Revision: D5664093

fbshipit-source-id: 92f501f61566f939c41ce0b614a1b499669f978a
2017-08-23 18:16:03 -07:00
..
docs Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
examples Add fp16 and tensorcore support to resnet50_trainer 2017-08-17 15:16:24 -07:00
helpers Layer norm brew wrapper 2017-08-17 11:17:47 -07:00
layers pairwise dot product with dot_groups support 2017-08-23 15:23:36 -07:00
mint doxygen python block added 2017-03-29 06:46:16 -07:00
mkl Support grouped convolutions in MKL 2017-07-25 14:19:02 -07:00
modeling Scaled training and fetching from the PS 2017-08-23 18:16:03 -07:00
models rectify args btw. train and translate 2017-08-10 15:27:18 -07:00
operator_test Added fast path for CUDNN global max pooling 2017-08-23 16:33:06 -07:00
predictor Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
rnn Revert D5589309: modify _LSTM into _RNN to adapt GRU 2017-08-10 16:42:41 -07:00
_import_c_extension.py doxygen python block added 2017-03-29 06:46:16 -07:00
allcompare_test.py Adding AllCompare-like function to data_parallel_model 2017-07-13 13:03:57 -07:00
attention.py Reduce memory usage for dot attention 2017-08-14 12:35:50 -07:00
binarysize.py binary size util 2017-07-14 17:49:24 -07:00
brew_test.py Allow passing unsymmetric 2d kernels to brew.conv. 2017-08-10 15:27:16 -07:00
brew.py Layer norm brew wrapper 2017-08-17 11:17:47 -07:00
build.py Strip Operator Schema in mobile build 2017-08-22 13:31:08 -07:00
caffe_translator_test.py
caffe_translator.py Read pretrained weights using binary mode in caffe_translator.py 2017-07-08 10:17:57 -07:00
checkpoint_test.py Changes the checkpoint naming rules. 2017-08-17 22:16:42 -07:00
checkpoint.py Adds checkpoint taskgroups to the online trainer. 2017-08-19 04:09:47 -07:00
CMakeLists.txt
cnn.py cnnmodelhelper deprecate warning 2017-05-18 23:35:26 -07:00
context_test.py Add default implementation of __call__ for context manager 2017-08-22 17:46:22 -07:00
context.py Add default implementation of __call__ for context manager 2017-08-22 17:46:22 -07:00
control_test.py
control.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
convnet_benchmarks_test.py
convnet_benchmarks.py brew API in convnet benchmark 2017-07-05 10:34:48 -07:00
core_gradients_test.py warn about orphan StopGradient output 2017-07-20 21:41:41 -07:00
core_test.py Support session in distributed realtime trainer 2017-08-16 10:28:55 -07:00
core.py Support session in distributed realtime trainer 2017-08-16 10:28:55 -07:00
crf.py Deprecate CNNModelHelper in python/crf.py 2017-06-14 08:49:27 -07:00
data_parallel_model_test.py disable travis test for dpm test 2017-08-15 19:17:41 -07:00
data_parallel_model.py Device-specific memongering 2017-08-17 13:31:26 -07:00
data_workers_test.py Caffe2: Refactor the core logic from data_workers.py into parallel_workers.py 2017-08-07 10:14:08 -07:00
data_workers.py Caffe2: Refactor the core logic from data_workers.py into parallel_workers.py 2017-08-07 10:14:08 -07:00
dataio_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
dataio.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
dataset.py Option to enforce batch size 2017-08-01 22:29:55 -07:00
db_test.py String-related fixes for Python 3 2017-05-26 16:04:32 -07:00
device_checker.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
dyndep.py doxygen python block added 2017-03-29 06:46:16 -07:00
embedding_generation_benchmark.py Benchmark for embedding generation 2017-08-15 14:22:41 -07:00
empty.so Adding video data layer for caffe2 2017-05-05 14:16:38 -07:00
experiment_util.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
extension_loader.py Make extension loader properly handle visibility. 2017-03-30 14:38:38 -07:00
gradient_check_test.py Cos, Sin, and Abs operators 2017-07-03 22:18:32 -07:00
gradient_checker.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
gru_cell.py Revert D5589309: modify _LSTM into _RNN to adapt GRU 2017-08-10 16:42:41 -07:00
hsm_util.py doxygen python block added 2017-03-29 06:46:16 -07:00
hypothesis_test_util.py Adding a range operator similar to np.arange 2017-08-18 14:45:56 -07:00
hypothesis_test.py Implement gradients for Col2Im and Im2Col operators 2017-08-07 15:51:30 -07:00
layer_model_helper.py Adding parameter sharing API to Dper2 2017-08-03 00:33:18 -07:00
layer_model_instantiator.py saving/loading CPU/GPU nets 2017-07-23 02:18:15 -07:00
layer_parameter_sharing_test.py Adding parameter sharing API to Dper2 2017-08-03 00:33:18 -07:00
layer_test_util.py Add a method to run a train net multiple times in layer_test_util.py 2017-07-28 19:56:05 -07:00
layers_test.py Create MergeIdListsLayer 2017-08-22 17:00:55 -07:00
load_save_test.py Allow Load operator to load into overriden names 2017-04-27 01:18:12 -07:00
lstm_benchmark.py Revert D5001637: [Caffe2][RNN] Threaded dependency-aware RNNExecutor (frontier/diagonal execution). 2017-08-16 03:21:49 -07:00
memonger_test.py Device-specific memongering 2017-08-17 13:31:26 -07:00
memonger.py Device-specific memongering 2017-08-17 13:31:26 -07:00
mkl_test_util.py Implement a filler op test 2017-07-25 14:18:57 -07:00
model_device_test.py Deprecate CNNModelHelper in caffe2/python/model_device_test.py 2017-06-22 15:37:17 -07:00
model_helper.py ExtractPredictorNet should strip gpu_id prefix from step_net 2017-07-27 16:06:47 -07:00
mpi_python.cc Fix pybind11 module name for MPI helpers 2017-05-02 23:18:50 -07:00
muji_test.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
muji.py Fixes range/xrange for Python 3 2017-06-07 00:04:26 -07:00
net_builder_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
net_builder.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
net_drawer.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
net_printer_test.py Allow tasks/execution_steps to be cloned at runtime 2017-06-20 22:32:07 -07:00
net_printer.py net_printer.to_string() accepts NetDef 2017-08-01 10:17:29 -07:00
optimizer_context.py allow param_info to set optimizer 2017-07-12 08:49:48 -07:00
optimizer_test_util.py Fp16 training initializers 2017-06-01 08:34:46 -07:00
optimizer_test.py Always use assertAlmostEqual for floats when crossing python and C boundaries 2017-08-06 14:51:11 -07:00
optimizer.py Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example. 2017-07-21 16:24:10 -07:00
parallel_workers_test.py Caffe2: Refactor the core logic from data_workers.py into parallel_workers.py 2017-08-07 10:14:08 -07:00
parallel_workers.py Caffe2 [easy]: Better exception logging in parallel_workers/data_workers 2017-08-10 15:27:19 -07:00
parallelize_gpu_bmuf_distributed_test.py Added Nesterov 2017-08-11 13:52:43 -07:00
pipeline.py Enable runtime cloning of tasks. 2017-06-21 03:18:20 -07:00
predictor_constants.py Re-apply #266 2017-04-25 21:17:04 -07:00
pybind_state_gpu.cc
pybind_state_mkl.cc MKL code move 2017-07-26 20:21:55 -07:00
pybind_state.cc Strip Operator Schema in mobile build 2017-08-22 13:31:08 -07:00
pybind_state.h fast simple-net memonger for C++ 2017-07-06 15:17:07 -07:00
python_op_test.py Fix some typos 2017-06-28 13:50:48 -07:00
queue_util.py remove unsed code and bring back single benchmark mode 2017-08-16 14:06:31 -07:00
record_queue.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
recurrent.py fix forward_only mode 2017-08-17 10:19:04 -07:00
rnn_cell.py fix forward_only mode 2017-08-17 10:19:04 -07:00
schema_test.py Return empty Struct when get_field has empty input 2017-08-01 19:49:47 -07:00
schema.py Return empty Struct when get_field has empty input 2017-08-01 19:49:47 -07:00
scope_test.py Fix corruption of NameScope when exception is thrown 2017-04-24 22:46:27 -07:00
scope.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
session_test.py Warn on setting blob on Scalar 2017-05-01 20:18:30 -07:00
session.py Support session in distributed realtime trainer 2017-08-16 10:28:55 -07:00
sparse_to_dense_mask_test.py String-related fixes for Python 3 2017-05-26 16:04:32 -07:00
task.py Support session in distributed realtime trainer 2017-08-16 10:28:55 -07:00
test_util.py doxygen python block added 2017-03-29 06:46:16 -07:00
text_file_reader.py doxygen python block added 2017-03-29 06:46:16 -07:00
timeout_guard.py Dict fixes/improvements and unittest targets for Python 3 in caffe2 core 2017-06-29 17:05:41 -07:00
toy_regression_test.py
tt_core_test.py
tt_core.py Fix a few typos and grammars in comment 2017-06-14 18:22:39 -07:00
utils.py Update proto definition 2017-08-22 19:01:18 -07:00
visualize.py Python 3 compatible integer division 2017-07-06 11:47:12 -07:00
workspace_test.py ApplyTransformIfFaster 2017-08-17 15:36:51 -07:00
workspace.py ApplyTransformIfFaster 2017-08-17 15:36:51 -07:00