pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Junjie Bai	2ad5dcbbe4	Make timeout in resnet50_trainer configurable (#17058 ) Summary: xw285cornell petrex dagamayank Pull Request resolved: https://github.com/pytorch/pytorch/pull/17058 Differential Revision: D14068458 Pulled By: bddppq fbshipit-source-id: 15df4007859067a22df4c6c407df4121e19aaf97	2019-02-13 17:03:48 -08:00
Junjie Bai	f169f398d0	Change the default image size from 227 to 224 in resnet50 trainer (#16924 ) Summary: cc xw285cornell Pull Request resolved: https://github.com/pytorch/pytorch/pull/16924 Differential Revision: D14018509 Pulled By: bddppq fbshipit-source-id: fdbc9e94816ce6e4b1ca6f7261007bda7b80e1e5	2019-02-09 11:18:58 -08:00
rohithkrn	ddeaa541aa	fix typo in resnet50_trainer.py Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16219 Differential Revision: D13776742 Pulled By: bddppq fbshipit-source-id: 10a6ab4c58159b3f619b739074f773662722c1d9	2019-01-22 17:28:04 -08:00
Shane Li	620ff25bdb	Enhance cpu support on gloo based multi-nodes mode. (#11330 ) Summary: 1. Add some gloo communication operators into related fallback list; 2. Work around to avoid compiling errors while using fallback operator whose CPU operator inherits from 'OperatorBase' directly like PrefetchOperator; 3. Add new cpu context support for some python module files and resnet50 training example file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/11330 Reviewed By: yinghai Differential Revision: D13624519 Pulled By: wesolwsk fbshipit-source-id: ce39d57ddb8cd7786db2e873bfe954069d972f4f	2019-01-15 11:47:10 -08:00
Orion Reblitz-Richardson	febc7ff99f	Add __init__.py so files get picked up on install (#14898 ) Summary: This will let us install tests and other Caffe2 python code as a part of running Caffe2 tests in PyTorch. Broken out of https://github.com/pytorch/pytorch/pull/13733/ cc pjh5 yf225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/14898 Reviewed By: pjh5 Differential Revision: D13381123 Pulled By: orionr fbshipit-source-id: 0ec96629b0570f6cc2abb1d1d6fce084e7464dbe	2018-12-07 13:40:23 -08:00
rohithkrn	0d663cec30	Unify cuda and hip device types in Caffe2 python front end (#14221 ) Summary: Goal of this PR is to unify cuda and hip device types in caffe2 python front end. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14221 Differential Revision: D13148564 Pulled By: bddppq fbshipit-source-id: ef9bd2c7d238200165f217097ac5727e686d887b	2018-11-29 14:00:16 -08:00
Xiaodong Wang	eb7a298489	Add resnext model to OSS (#11468 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11468 Add resnext model into OSS Caffe 2 repo. Reviewed By: orionr, kuttas Differential Revision: D9506000 fbshipit-source-id: 236005d5d7dbeb8c2864014b1eea03810618d8e8	2018-09-12 15:59:20 -07:00
Orion Reblitz-Richardson	6223bfdb1d	Update from Facebook (#6692 ) * [GanH][Easy]: Add assertion to adaptive weighting layer 0 weight causes numeric instability and exploding ne * [Easy] Add cast op before computing norm in diagnose options As LpNorm only takes floats we add a manual casting here. * Introduce a new caching device allocator `cudaMalloc` and `cudaFree` calls are slow, and become slower the more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock because GPU memory is transparently shared across all GPUs. Normally, this isn't much of a concern since workloads allocate memory upfront, and reuse it during later computation. However, under some computation models (specifically, memory conserving approaches like checkpoint-and-recompute, see https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9) this assumption is no longer true. In these situations, `cudaMalloc` and `cudaFree` are common and frequent. Furthermore, in data parallel contexts, these calls happen at nearly the same time from all GPUs worsening lock contention. A common solution to this problem is to add a custom allocator. In fact, nVIDIA provides one out of the box: CUB, which Caffe2 already supports. Unfortunately, the CUB allocator suffers from very high fragmentation. This is primarily because it is a "buddy" allocator which neither splits nor merges free cached blocks. Study https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you want to convince yourself. This diff adapts a caching allocator from the Torch codebase https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp which does splitting and merging and ends up working really well, at least for workloads like the checkpoint-and-recompute computation models noted above. I simplified the implementation a little bit, made it a bit more C++-like. I also removed a bunch of stream synchronization primitives for this diff. I plan to add them back in subsequent diffs. * Report reader progress in fblearner workflows Integrate with fblearner progress reporting API and add support to report training progress from reader nodes. If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split. If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate. * [GanH][Diagnose]: fix plotting 1. ganh diagnose needs to set plot options 2. modifier's blob name is used for metric field can need to be fixed before generating net * Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8 * Make CompositeReader stops as soon as one reader finishes Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data. * [dper] make sure loss is not nan as desc. * [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but will soon become important. * Intra-op parallel FC operator Intra-op parallel FC operator * [C2 Proto] extra info in device option passing extra information in device option design doc: https://fb.quip.com/yAiuAXkRXZGx * Unregister MKL fallbacks for NCHW conversions * Tracing for more executors Modified Tracer to work with other executors and add more tracing * Remove ShiftActivationDevices() * Check for blob entry iff it is present When processing the placeholders ops, ignore if the blob is not present in the blob_to_device. * Internalize use of eigen tensor Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries. * feature importance for transformed features. * - Fix unused parameter warnings The changes in this diff comments out unused parameters. This will allow us to enable -Wunused-parameter as error. #accept2ship * add opencv dependencies to caffe2 The video input op requires additional opencv packages. This is to add them to cmake so that it can build * Add clip_by_value option in gradient clipping Add clip_by_value option in gradient clipping when the value is bigger than max or smaller than min, do the clip * std::round compat	2018-04-17 23:36:40 -07:00
Orion Reblitz-Richardson	1d5780d42c	Remove Apache headers from source. * LICENSE file contains details, so removing from individual source files.	2018-03-27 13:10:18 -07:00
Kutta Srinivasan	0ee53bf7fe	Fix one more naming issue in resnet50_trainer.py for PR 2205	2018-03-09 13:51:42 -08:00
Kutta Srinivasan	ed05ca9fec	Clean up naming of FP16-related code, add comments	2018-03-09 13:51:42 -08:00
Lukasz Wesolowski	29a4c942fe	Add support for multi-device batch normalization through an option to data_parallel_model Summary: Stage 3 in stack of diffs for supporting multi-device batch normalization. Adds input parameter to data_parallel_model to enable multi-device batch normalization. Depends on D6699258. Reviewed By: pietern Differential Revision: D6700387 fbshipit-source-id: 24ed62915483fa4da9b1760eec0c1ab9a64b94f8	2018-01-24 13:24:06 -08:00
Anirban Roychowdhury	158e001238	Checking for positive epoch size before running epoch Summary: Checking for positive epoch size before running epoch Reviewed By: pietern Differential Revision: D6738966 fbshipit-source-id: 64e1fb461d784786b20a316999e4c037787f3a14	2018-01-18 11:48:35 -08:00
Aapo Kyrola	2caca70a37	Allow shifting of activations / ops to other GPUs in data parallel model Summary: (Work in progress). This diff will allow shifting of activations to other GPUs, in case the model does not fit into memory. To see the API, check the code in data_parallel_model_test, which tests shifting two activations from 0 and 1 to gpu 4, and from gpu 2 and 3 to gpu 5. I will need to further test on ResNets, and probablly add copy operations to handle device change points. Reviewed By: asaadaldien Differential Revision: D5591674 fbshipit-source-id: eb12d23651a56d64fa4db91090c6474218705270	2017-11-29 21:17:00 -08:00
Aapo Kyrola	b71cebb11f	Fix LoadModel() in resnet50_trainer Summary: resnet50 trainer will save the 'optimizer_iteration' blob in checkpoints, but loads it i in GPU context. This fails because AtomicIter/Iter expect the blob to be in CPU context. So manually reset the optimizer_iteration in CPU context. I am thinking of making the iter-operators automatically do this switch, but in the mean time this unbreaks the trainer. Reviewed By: sf-wind Differential Revision: D6232626 fbshipit-source-id: da7c183a87803e008f94c86b6574b879c3b76438	2017-11-03 11:15:25 -07:00
Aapo Kyrola	b5c053b1c4	fix fp16 issues with resnet trainer Summary: My commit bab5bc broke things wiht fp16 compute, as i had tested it only with the null-input, that actually produced fp32 data (even dtype was given as float16). Also, I had confused the concepts of "float16 compute" and fp16 data. Issue #1408. This fixes those issues, tested with both Volta and M40 GPUs. Basically restored much of the previous code and fixed the null input to do FloatToHalf. Reviewed By: pietern Differential Revision: D6211849 fbshipit-source-id: 5b41cffdd605f61a438a4c34c56972ede9eee28e	2017-11-01 13:30:08 -07:00
Aapo Kyrola	669ec0ccba	Added FP16 compute support to FC Op Summary: Allow the GEMMs in the FC/FCGradient Op to do FP16 compute instead of FP32 if the appropriate op flag is set. Reviewed By: asaadaldien Differential Revision: D5839777 fbshipit-source-id: 8051daedadf72bf56c298c1cf830b019b7019f43	2017-10-30 17:03:51 -07:00
Aapo Kyrola	1b71bf1d36	Updated resnet50_trainer and resnet for more FP16 support Summary: Added FP16SgdOptimizer to resnet50_trainer Reviewed By: wesolwsk Differential Revision: D5841408 fbshipit-source-id: 3c8c0709fcd115377c13ee58d5bb35f1f83a7105	2017-10-24 09:19:06 -07:00
Luke Yeager	ee143d31ef	Fix ImageInput op in resnet50_trainer.py Summary: Fix #1269 (from fa0fcd4053dd42a4ec3a2a12085662179f0e11df). Closes https://github.com/caffe2/caffe2/pull/1314 Reviewed By: bwasti Differential Revision: D6021171 Pulled By: bddppq fbshipit-source-id: 7d7c45f8b997c25f34530f826729d700a9c522d4	2017-10-10 11:20:52 -07:00
Junjie Bai	d894a6362f	Add missing is_test argument in ImageInput ops Summary: reported in Github Issue https://github.com/caffe2/caffe2/issues/1269 Reviewed By: salexspb Differential Revision: D6004461 fbshipit-source-id: 03f4bccfe085010b30109ab7b6fe7325caa160ef	2017-10-10 10:03:13 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Aapo Kyrola	9ec981b866	for CPU-data parallel, allow sharing model Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode. Reviewed By: wesolwsk Differential Revision: D5812181 fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea	2017-09-15 16:19:37 -07:00
Pieter Noordhuis	27dde63358	Allow run of example resnet50_trainer without training data Summary: This is useful for pure throughput tests where we don't care about training a real model. Reviewed By: akyrola Differential Revision: D5834293 fbshipit-source-id: dab528c9269fb713e6f6b42457966219c06e0a35	2017-09-15 09:45:11 -07:00
Aapo Kyrola	1e37145872	Resnet50 should param init net before creating test net Summary: Otherwise weights, biases are not created and test creation fails Reviewed By: gsethi523 Differential Revision: D5836438 fbshipit-source-id: 32a75313b6b9ebecbfaa43ebd39f19c8eaba8cd1	2017-09-14 16:06:01 -07:00
Pieter Noordhuis	d43ab4bec5	Create Gloo common world through MPI rendezvous Summary: Before this change there were two ways for machines to rendezvous for a distributed run: shared file system or Redis. If you're using an MPI cluster it is much more convenient to simply execute mpirun and expect the "right thing (tm)" to happen. This change adds the "mpi_rendezvous" option to the CreateCommonWorld operator. If this is set, the common world size and rank will be pulled from the MPI context and Gloo rendezvous takes place using MPI. Note that this does NOT mean the MPI BTL is used; MPI is only used for rendezvous. Closes https://github.com/caffe2/caffe2/pull/1190 Reviewed By: akyrola Differential Revision: D5796060 Pulled By: pietern fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226	2017-09-08 17:18:47 -07:00
Pieter Noordhuis	b8eb8ced7d	Add transport/interface arguments to CreateCommonWorld operator Summary: These arguments control which Gloo transport (TCP or IB) and which network interface is used for the common world. If not specified, it defaults to using TCP and the network interface for the IP that the machine's hostname resolves to. The valid values for the transport argument are "tcp" and "ibverbs". For ibverbs to work, Gloo must have been compiled with ibverbs support. If Gloo is built as part of Caffe2 (sourced from the third_party directory), then you can pass -DUSE_IBVERBS=ON to CMake to enable ibverbs support in Gloo. Closes https://github.com/caffe2/caffe2/pull/1177 Reviewed By: akyrola Differential Revision: D5789729 Pulled By: pietern fbshipit-source-id: 0dea1a115c729e54c5c1f9fdd5fb29c14a834a82	2017-09-08 10:57:41 -07:00
Luke Yeager	f7ece79949	Add fp16 and tensorcore support to resnet50_trainer Summary: Use like `--dtype=float16 --enable-tensor-core` Closes https://github.com/caffe2/caffe2/pull/1093 Differential Revision: D5634840 Pulled By: harouwu fbshipit-source-id: 18c1e70236ba5ef8661ff55fb524caae1be19310	2017-08-17 15:16:24 -07:00
Luke Yeager	f92fdd850d	Important typo in resnet50_trainer Summary: Closes https://github.com/caffe2/caffe2/pull/1092 Reviewed By: Yangqing Differential Revision: D5637489 Pulled By: harouwu fbshipit-source-id: 13609a3e14a45e640849268821fd8565fd7aae4d	2017-08-15 19:03:15 -07:00
Pieter Noordhuis	d177846dbf	Add prefix argument to FileStoreHandler Summary: This brings it up to par with how the RedisStoreHandler works. The store handler configuration does not have to change and only the run ID parameter changes across runs. This was inconsistent and came up in https://github.com/caffe2/caffe2/issues/984. Reviewed By: Yangqing Differential Revision: D5539299 fbshipit-source-id: 3b5f31c6549b46c24bbd70ebc0bec150eac8b76c	2017-08-03 10:37:26 -07:00
Aapo Kyrola	26645154bb	warn about using test/val model with init_params=True + fixed some cases Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...). Reviewed By: asaadaldien Differential Revision: D5509963 fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f	2017-07-27 13:20:27 -07:00
Ahmed Taei	804ebf7c41	Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example. Reviewed By: akyrola Differential Revision: D5463772 fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d	2017-07-21 16:24:10 -07:00
Hyungsuk Kang	ca2b608f83	Fixed typo Summary: peaces -> pieces, peace -> piece Closes https://github.com/caffe2/caffe2/pull/819 Differential Revision: D5312417 Pulled By: aaronmarkham fbshipit-source-id: 59d2c3f475197a5f29dc7cf3ecaf675a242d3cdf	2017-06-23 14:02:40 -07:00
Aapo Kyrola	34eaa19d27	CPU data parallel model Summary: CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs). Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later. Reviewed By: wesolwsk Differential Revision: D5277350 fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210	2017-06-20 23:19:08 -07:00
Xiaoti Hu	969831ea33	Deprecate CNNModelHelper in lmdb_create_example Reviewed By: akyrola Differential Revision: D5233793 fbshipit-source-id: bae745791f071bc36fd45bd81145ce86c8ba9ed0	2017-06-19 13:04:02 -07:00
Aapo Kyrola	feba1eed00	resnet50: fetch right lr Summary: I broke resnet50 when switching to use optimizer, which uses LR per parameter. This only happens after each epoch, and I did no test patiently enough. For a stop-gap, while asaadaldien works on a better solution, just fetch the lr of a conv1_w param. Reviewed By: asaadaldien Differential Revision: D5207552 fbshipit-source-id: f3474cd5eb0e291a59880e2834375491883fddfc	2017-06-07 21:46:35 -07:00
Thomas Dudziak	60c78d6160	Fixes range/xrange for Python 3 Summary: As title Differential Revision: D5151894 fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638	2017-06-07 00:04:26 -07:00
Xiangyu Wang	c9c862fa8f	16117716 [Caffe2 OSS] make char-rnn exapmle use build_sgd Summary: replace hand made sgd with build_sgd Reviewed By: salexspb Differential Revision: D5186331 fbshipit-source-id: 3c7b4b370e29a1344b95819766463bae3812c9a6	2017-06-06 13:54:59 -07:00
Aapo Kyrola	401908d570	add_weight_decay + restore weight decay to resnet50_trainer Summary: Add add_weight_decay to optimizer + test. In D5142973 I accidentally removed weight decay from resnet50 trainer, so this restores it. Reviewed By: asaadaldien Differential Revision: D5173594 fbshipit-source-id: c736d8955eddff151632ae6be11afde0883f7531	2017-06-02 14:16:56 -07:00
Aapo Kyrola	cdb50fbf2b	add optimizer support to data_parallel_model; Use MomentumSGDUpdate Summary: This diff does two things: - add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model. - use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits. Changes resnet50 trainer to use optimizer. This relies on D5133652 Reviewed By: dzhulgakov Differential Revision: D5142973 fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2	2017-05-30 12:49:57 -07:00
Aapo Kyrola	0af0cba2b7	Refactor data_parallel_model initial sync and checkpointing Summary: Major improvements. Before we only synced "params" and "computed params" of model after initialization and after loading a checkpoint. But actually we want to sync all blobs that are generated in the param_init_net. For example the _momentum blobs were missed by the previous implementation and had to be manually included in checkpoint finalization. I also added GetCheckpointParams() to data_parallel_model because it is now fully general. Also added a unit test. Reviewed By: andrewwdye Differential Revision: D5093689 fbshipit-source-id: 8154ded0c73cd6a0f54ee024dc5f2c6826ed7e42	2017-05-19 12:48:06 -07:00
Yiming Wu	a28b01c155	rnn with brew Summary: Update rnn_cell.py and char_rnn.py example with new `brew` model. - Deprecated CNNModelHelper - replace all helper functions with brew helper functions - Use `model.net.<SingleOp>` format to create bare bone Operator for better clarity. Reviewed By: salexspb Differential Revision: D5062963 fbshipit-source-id: 254f7b9059a29621027d2b09e932f3f81db2e0ce	2017-05-16 13:33:44 -07:00
Yiming Wu	64d43dbb6e	new resnet building with brew Summary: new resnet building with brew Reviewed By: akyrola Differential Revision: D4945418 fbshipit-source-id: d90463834cbba2c35d625053ba8812e192df0adf	2017-05-15 22:47:24 -07:00
Heng Wang	8a2433eacb	Add model saving and loading to resnet50_trainer.py Summary: Script caffe2/caffe2/python/examples/resnet50_trainer.py can be used to train a ResNet-50 model with Imagenet data (or similar). However, currently the script does not actually save the model, so it is kind of useless. Task 1: After each Epoch, save the model in a file "<filename>_X.mdl' where X is the epoch number and <filename> is given as a command line parameter. By default, use "resnet50_model" as filename. Task 2: Add a functionality to restore the model from a previous file: - add a command line parameter "load_model", which user can use to specify a filename. - if this parameter is set, load the model parameters from the previous file Reviewed By: prigoyal Differential Revision: D4984340 fbshipit-source-id: 333e92679ba52a7effe9917fdfc2d55d652b868f	2017-05-05 10:08:37 -07:00
Yury Zemlyanskiy	31643d5ecb	Inference code for seq2seq model Summary: Beam search implementation Differential Revision: D4975939 fbshipit-source-id: 67d8b73390221583f36b4367f23626a2aa80f4b4	2017-05-02 22:47:28 -07:00
Yiming Wu	885f906e67	resnet train print loss and accuracy Summary: printing resnet training loss and accuracy for each batch so that people will have better idea of what is going on Reviewed By: pietern Differential Revision: D4945390 fbshipit-source-id: 0fcd60f4735e81641355aba6e6cbf0e57e886e38	2017-04-25 16:03:58 -07:00
Jay Mahadeokar	4dafb608e7	Fix char_rnn LSTM import Summary: Fix for char_rnn.py with latest LSTM changes in rFBS779c69758cee8caca6f36bc507e3ea0566f7652a. Fixed some linting issues. Reviewed By: salexspb Differential Revision: D4927018 fbshipit-source-id: cda760a170056b8bc237b4c565cc34800992c8e0	2017-04-20 22:46:19 -07:00
inspire99	f750a2d2df	fix a few typos Summary: fix typo: Dimention, probablity Closes https://github.com/caffe2/caffe2/pull/310 Differential Revision: D4915798 Pulled By: Yangqing fbshipit-source-id: 3a16d3adc469c9930ce0dad8584c4678b3c3b5c0	2017-04-19 13:31:33 -07:00
Yury Zemlyanskiy	4bf559eddb	RNNCell, LSTMCell, LSTMWithAttentionCell Summary: This is the nice way to re-use RNN layers for training and for inference. Reviewed By: salexspb Differential Revision: D4825894 fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a	2017-04-18 00:47:20 -07:00
Pieter Noordhuis	8c9f4d8c3b	Add throughput information to resnet50_trainer Summary: TSIA Makes it easier for throughput debugging. Differential Revision: D4879634 fbshipit-source-id: 8d479d51b0ec51ad3d86ad5500fc3095400cf095	2017-04-12 17:46:14 -07:00
Pieter Noordhuis	c907c7c7dc	Update resnet50_trainer example Summary: A few fixes in this commit: the epoch size is now rounded down to the closest integer multiple of the global batch size (batch per GPU * GPUs per hosts * hosts per run). The num_shards and shard_id parameters are now passed to CreateDB so multiple processes actually train on different subsets of data. The LR step size is scaled by the number of hosts in the run. The test accuracy is only determined after each epoch instead of after every so many iterations. Differential Revision: D4871505 fbshipit-source-id: d2703dc7cf1e0f76710d9d7c09cd362a42fe0598	2017-04-12 14:03:51 -07:00

1 2

73 Commits