pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Luke Yeager	6b1cf26380	Fix for dpm when GPUs don't have p2p access Summary: See discussion at https://github.com/caffe2/caffe2/pull/633#issuecomment-303536902 Tested with a TitanX (Pascal) and a TitanZ (Kepler) with this access pattern. ``` Checking GPU(s) for support of peer to peer memory access... > Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU1) : No > Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU2) : No > Peer access from GeForce GTX TITAN Z (GPU1) -> TITAN X (Pascal) (GPU0) : No > Peer access from GeForce GTX TITAN Z (GPU1) -> GeForce GTX TITAN Z (GPU2) : Yes > Peer access from GeForce GTX TITAN Z (GPU2) -> TITAN X (Pascal) (GPU0) : No > Peer access from GeForce GTX TITAN Z (GPU2) -> GeForce GTX TITAN Z (GPU1) : Yes ``` All combinations pass: * `0,1` * `0,2` * `1,2` * `0,1,2` Closes https://github.com/caffe2/caffe2/pull/659 Differential Revision: D5148779 Pulled By: akyrola fbshipit-source-id: 6263edfe8b36623983f1946b5c3f4a3fef415a45	2017-05-30 12:02:19 -07:00
Deepak Gopinath	33c40e8a6e	Handling shared indices in sparse gradient updates Summary: When two or more blobs are gathered by the same indices blob in a data parallel model, we used to concatenate multiple times and re-write to the same indices blob. This leads to illegal memory access at times because the gradientslice indices blob is longer than its corresponding gradientslice values blob. This diff adds a check in order to avoid this. Reviewed By: akyrola Differential Revision: D5116817 fbshipit-source-id: 1c086d092eb6d48926d600f9408f578f5ddc41c7	2017-05-24 22:47:00 -07:00
Aapo Kyrola	a2c01e830b	fix duplicate init blob issue + fix test Summary: Address KaimingHe's comments in D5093689 about same blob being initialized twice causing internal consistency check to fail. Also I noticed that my new test for test_checkpoint_params was completely botched due to an indentatino issue (it did not actually execute any test). So this fixes that as well. Modified the test to add a duplicate param initializer, so that this bug is tested for. Reviewed By: KaimingHe Differential Revision: D5101304 fbshipit-source-id: 72f343035c1b4953e7bb9a1a1c171cf05d3ead26	2017-05-20 09:18:29 -07:00
Aapo Kyrola	0af0cba2b7	Refactor data_parallel_model initial sync and checkpointing Summary: Major improvements. Before we only synced "params" and "computed params" of model after initialization and after loading a checkpoint. But actually we want to sync all blobs that are generated in the param_init_net. For example the _momentum blobs were missed by the previous implementation and had to be manually included in checkpoint finalization. I also added GetCheckpointParams() to data_parallel_model because it is now fully general. Also added a unit test. Reviewed By: andrewwdye Differential Revision: D5093689 fbshipit-source-id: 8154ded0c73cd6a0f54ee024dc5f2c6826ed7e42	2017-05-19 12:48:06 -07:00
Ahmed Taei	25fd005dd9	Initial implementation of Blockwise Model Update Filtering (BMUF) Summary: A Single machine multi-GPU version of BMUF algorithm. BMUF is a modification to model averaging where updates to global model is implemented as a filter: param_t = param_(t-1) + delta delta = \beta delta_(t-1) + \alpha average(param_t) - param_(t-1) Reviewed By: akyrola Differential Revision: D4995057 fbshipit-source-id: 48176ba66d67eaf3fa4dee16d50d9589825ddba4	2017-05-15 18:18:15 -07:00
Yury Zemlyanskiy	4bf559eddb	RNNCell, LSTMCell, LSTMWithAttentionCell Summary: This is the nice way to re-use RNN layers for training and for inference. Reviewed By: salexspb Differential Revision: D4825894 fbshipit-source-id: 779c69758cee8caca6f36bc507e3ea0566f7652a	2017-04-18 00:47:20 -07:00
Aapo Kyrola	f94f43fd6e	Working sparse gradients for data parallel model Summary: This diff enables sparse gradient synchronization between GPUs. The test case is now a bit too convoluted, but once D4871680 is landed, we can simplify it a bit. Reviewed By: dzhulgakov Differential Revision: D4877087 fbshipit-source-id: 37bbb07051cbaf3a6e3c54b0eead97f3e02337d5	2017-04-13 17:39:23 -07:00
Aapo Kyrola	02f0c1c9d7	make memonger work with RecurrentNetwork(Gradient) Summary: This diff enables support of recurrent networks for memonger: 1. Memonger descends into the step-nets and renames the blobs accordingly 2. Memonger tells the gradient op about the renamed blobs by adding a parameter "paramname.renamed=<new name>" 3. RecurrentNetworkGradientOp applies remapping to links and gradient blobs. I first thought of refactoring the whole gradient blob management of the recurrent network, but that looks to be very hard without a major revise of the code. Note, I did not enable memonger for neural_mt, since I think the team should do more testing before enabling this. Reviewed By: salexspb Differential Revision: D4812823 fbshipit-source-id: 1ffdf3cfb4fcd00eec5bb0ece3bf416aa6d3e26b	2017-04-05 09:48:25 -07:00
Aapo Kyrola	91f468b15c	fixes to make data parallel model work for RecurrentNet + test case Summary: First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made: - cell net/step net external inputs must be namespace scoped - prevent double-namescoping of cellnet inputs - make data parallel model understand recurrentnets so the device-mapping works Reviewed By: salexspb Differential Revision: D4708840 fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4	2017-03-14 15:48:07 -07:00
Aapo Kyrola	89c08334bb	data_parallel_model support for sparse gradients and CPU ops Summary: Data parallel model did not support sparse operations, nor gradients computed on CPU ops. Currently sparse operations are done on CPU, so there is no point of "data parallelizing" them. I had to make a few changes to data_parallel_model to support this: 1. Model can have params that are added prior to adding the data parallel part. For example, a lookup table of word vectors would be a parameter that is non-parallel. 2. Thus, when data parallel model is called, it will separate the non-parallel params and avoid working on them. Note: when we add distributed version, we need to explicitly handle them with AllGather! This works nicely since Caffe2 automatically adds the backward concat-operator when multiple ops gather from the same blob. I also added support for data parallel CPU ops, which might be necessary in cases when we don't have GPU implemenation of some ops. Test in data_parallel_model_test validates the correctness of the code by running the same trainer on different number of gpus and checking the end result is same. Reviewed By: jhcross Differential Revision: D4649208 fbshipit-source-id: e3b7ae701ead468dc94c52a976eafec5c9831097	2017-03-09 13:48:41 -08:00
Aapo Kyrola	1c7886701e	lr_scale to loss_scale Summary: As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/. So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this. In this diff I modified all my models to work correctly. Reviewed By: Yangqing Differential Revision: D4507002 fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90	2017-02-03 07:44:40 -08:00
Aapo Kyrola	3410939459	pass learning rate scaling factor to parameter update builder function Summary: When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus. Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size. Reviewed By: prigoyal Differential Revision: D4248907 fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be	2016-12-05 11:53:26 -08:00
Yangqing Jia	238ceab825	fbsync. TODO: check if build files need update.	2016-11-15 00:00:46 -08:00
Yangqing Jia	d1e9215184	fbsync	2016-10-07 13:08:53 -07:00

14 Commits