pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Aapo Kyrola	f82a510be6	share forward activation blobs + pass unused free blobs down all branches + use shape infernece Summary: Added optional support for using activation blobs for sharing as well. Doing this change revealed an non-optimal implementation in the blob sharing: we need to prefer to reuse freeblobs by prefering those blobs that are already shared by many other blobs. Otherwise the memory usage can increase when the pool of 'free blobs' grows. Also, my first version only passed "free blobs" (i.e blobs in recycling pool) down the first branch when operators forked. But now we pass those blobs that were not used by the first branch down the second branch and so on. Also added support for blob size information in the heuristic. This uses the shape inference mechanism. I had to also do some small tweaks: - use Sum() operator as a way to match shapes of blobs that had otherwise unknown shapes. This is related to the Sum() operator that is added to combine multiple incoming gradient inputs (with _autosplit gradients). - a couple of random shape inference fixes This reduces the Resnet-50 memory usage on 64 batch from 9.45 Gig to 8.5 Gig. For a 32 batch, the memory usage is 4330 MiB, down from 4800 MB, compared to Torch's 6856MiB (thanks prigoyal for checking this for me). This is unfortunately quite a bunch to review... Reviewed By: asaadaldien Differential Revision: D4393909 fbshipit-source-id: 9c7c94125f96512bea80463ebcb63c215ef95ff9	2017-04-25 14:23:25 -07:00
Xian Li	4c08d6ae3b	Allow cpu-only grad update in Parallelize_GPU. Summary: Instead of requiring gradient updates on GPU, this change will allow the usage when loss computation happens on GPU while all grad updates happen on CPU. Reviewed By: jhcross Differential Revision: D4943996 fbshipit-source-id: 1f2144c4277dfdb865877e0d0216ca1ac7dd7309	2017-04-24 18:47:36 -07:00
Yiming Wu	bef6e45f8b	rename ModelHelperBase Summary: rename ModelHelperBase to Model. This is the result of running: find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} + We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py Reviewed By: salexspb Differential Revision: D4928337 fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3	2017-04-24 15:52:26 -07:00
Pieter Noordhuis	e0a904011b	Use gradient name for allreduce op name Summary: This may help tell different allreduce operations apart during debugging/tracing. Reviewed By: prigoyal Differential Revision: D4897921 fbshipit-source-id: bbb2ce02a3e1f467ad54f8a3aed6a4e2b26a9fe4	2017-04-17 23:31:27 -07:00
Pieter Noordhuis	ed1e342860	Reuse common world for allreduce/broadcast Summary: The common worlds can be reused without performance impact as long as there is a guarantee that no two algorithm instances are using it at any given time. Since we know the ordering and the maximum parallelism, we can cycle through common worlds, and reuse them accordingly. Differential Revision: D4896779 fbshipit-source-id: 164e1727692eab904fa6879a9f91a3e8332a2e30	2017-04-17 23:31:26 -07:00
Aapo Kyrola	f94f43fd6e	Working sparse gradients for data parallel model Summary: This diff enables sparse gradient synchronization between GPUs. The test case is now a bit too convoluted, but once D4871680 is landed, we can simplify it a bit. Reviewed By: dzhulgakov Differential Revision: D4877087 fbshipit-source-id: 37bbb07051cbaf3a6e3c54b0eead97f3e02337d5	2017-04-13 17:39:23 -07:00
Aapo Kyrola	4967db0756	sanity checks for data parallel model Summary: To help dgponinath, and people in general: check that params don't have duplicate entries. Differential Revision: D4872132 fbshipit-source-id: 1cca1237fda771eb270227f452ecae0f912d7a33	2017-04-12 09:32:12 -07:00
Aaron Markham	58f7f2b441	doxygen python block added Summary: Closes https://github.com/caffe2/caffe2/pull/226 Differential Revision: D4793550 Pulled By: JoelMarcey fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e	2017-03-29 06:46:16 -07:00
James Cross	79c3a3af54	add gpu support for caffe2-seq2seq Summary: Adding synchronous optimization on GPUs to the translation training pipeline, via data_parallel_model.Parallelize_GPU, which needs to be updated so there is some way of performing sparse parameter updates (e.g., on embedding tables), whether on GPU or CPU. Reviewed By: urikz Differential Revision: D4631914 fbshipit-source-id: 9cdd655f7dbda3f9b2733d459228b3e097892441	2017-03-17 05:19:14 -07:00
Pieter Noordhuis	9e6fd02c28	Use Gloo ops in data_parallel_model Summary: No longer need GPU to CPU copies. The allreduce operator no longer uses 'local allreduce - global allreduce - local broadcast' sequence when Gloo is used, but passes all input blobs directly. Depends on D4708860. Differential Revision: D4709897 fbshipit-source-id: 4d745d5d8bac9c2fcca081dd5d812c902808c3b6	2017-03-14 22:34:51 -07:00
Aapo Kyrola	91f468b15c	fixes to make data parallel model work for RecurrentNet + test case Summary: First, this diff includes a full test of data-parallel LSTM, which confirms it works correctly. To make it work, some changes had to be made: - cell net/step net external inputs must be namespace scoped - prevent double-namescoping of cellnet inputs - make data parallel model understand recurrentnets so the device-mapping works Reviewed By: salexspb Differential Revision: D4708840 fbshipit-source-id: 4b0ddc43642d449076a2b6f67ad1c47f84138ff4	2017-03-14 15:48:07 -07:00
Aapo Kyrola	fc7939c25b	add model_helper.ExtractPredictorNet() Summary: It has been a pain to save predictor-compatible models from Caffe2. This diff adds function ExtractPredictorNet that takes a training model and outputs a predictor model by removing all operators that are not relevant for prediction, such as backward pass and dequeue-ops for input loading (as in predictor, the input data is external input). We can also consider including this directly in the predictor exporter for FB usage. Reviewed By: rpenggithub Differential Revision: D4693264 fbshipit-source-id: e81abbbec0bd4d717159cf36488d0baaf0130090	2017-03-13 16:32:04 -07:00
Aapo Kyrola	3f682ca699	Fix to data parallel model blob_to_device mapping Summary: We need the InferToDeviceMapping too early, or we should had done it also after running parameter update function since that can create new blobs like the momentum blobs. This fix is maybe not optimal, but works and is fast enough. Differential Revision: D4693450 fbshipit-source-id: 4c4cc2396dad371b3fbcd1d8da51133ea09a57e0	2017-03-10 18:03:58 -08:00
Aapo Kyrola	a109cbdfb6	fix bug in data_parallel_model stripParams() Summary: Thanks for shenpan, detected this bug. Problem is that FinalizeAfterCheckponit() can be passed a list of strings, not blob references, and that fails in stripParam() after assertion I added in D4649208. It is ok to pass strings as well to that function. Reviewed By: jhcross Differential Revision: D4691028 fbshipit-source-id: 0bca80d44a5ab641438cc5b26482bca0b1527d69	2017-03-10 13:17:11 -08:00
Aapo Kyrola	89c08334bb	data_parallel_model support for sparse gradients and CPU ops Summary: Data parallel model did not support sparse operations, nor gradients computed on CPU ops. Currently sparse operations are done on CPU, so there is no point of "data parallelizing" them. I had to make a few changes to data_parallel_model to support this: 1. Model can have params that are added prior to adding the data parallel part. For example, a lookup table of word vectors would be a parameter that is non-parallel. 2. Thus, when data parallel model is called, it will separate the non-parallel params and avoid working on them. Note: when we add distributed version, we need to explicitly handle them with AllGather! This works nicely since Caffe2 automatically adds the backward concat-operator when multiple ops gather from the same blob. I also added support for data parallel CPU ops, which might be necessary in cases when we don't have GPU implemenation of some ops. Test in data_parallel_model_test validates the correctness of the code by running the same trainer on different number of gpus and checking the end result is same. Reviewed By: jhcross Differential Revision: D4649208 fbshipit-source-id: e3b7ae701ead468dc94c52a976eafec5c9831097	2017-03-09 13:48:41 -08:00
Pieter Noordhuis	c115646d71	Use fbcollective Summary: Update data parallel model to default to using fbcollective. Update broadcast op to correctly handle Tensor<long>. Differential Revision: D4508029 fbshipit-source-id: 7b8d17223e25b3e1098ee3f2a08af61af140729e	2017-02-07 10:48:33 -08:00
Priya Goyal	3c90356499	Add check for num_shards when using distributed training Summary: If num_shards = 1 and distributed training is on, then ring reduce fails when it looks for left pair to exchange information. I also used the opportunity to do a small fix in my data loader benchmark Differential Revision: D4513545 fbshipit-source-id: 7d3115b871a39b8ce7b55553394b607d16e08b74	2017-02-06 20:19:19 -08:00
Aapo Kyrola	3049bc1fed	Fix data parallel model code doc Summary: Thanks rpenggithub Reviewed By: rpenggithub Differential Revision: D4510933 fbshipit-source-id: 25e33ac0ba5a5143fc5bbe1abb615d7512c7ef41	2017-02-06 12:33:33 -08:00
Aapo Kyrola	1c7886701e	lr_scale to loss_scale Summary: As per discussion in https://www.prod.facebook.com/groups/184236721951559/permalink/354591931582703/, KaimingHe pointed out that scaling LR is not same as scaling Loss, since LR scaling will affect the weight decay (which is implemented by modifying the gradient, which thus is not yet correctly 'averaged'). Actually prigoyal tried to convince me earlier that loss scaling is the way to go, but I was then not convinved :/. So this diff removes the LR scaling parameter passed by data_parallel_model and instead passes a loss_scale parameter to the model creation function. Unfortunately, this will break all existing code that uses the data parallel model. But that is not only a bad thing, since it will bring awareness to this change. I will inform in the FB groups about this. In this diff I modified all my models to work correctly. Reviewed By: Yangqing Differential Revision: D4507002 fbshipit-source-id: 16c7221663282f71a1b754b34de0c8ccd5c2ca90	2017-02-03 07:44:40 -08:00
Priya Goyal	40ce50e0bd	Speed-up training, fast data-augmentation, sync data_parallel_model changes + other small fixes Summary: 1. Use opencv for data augmentation after benchmarking various image libraries in python 2. Use cuda no bias conv 3. Use cuda fastest conv (exhaustive search) 4. data_parallel_model had a few changes. Syncing them 3. propagate the errors in threads to make debugging easy Reviewed By: rbgirshick Differential Revision: D4341422 fbshipit-source-id: aa4471a2f49dd6d7ca13879999b3c7ceaf818c1e	2017-01-25 11:44:22 -08:00
Aapo Kyrola	b96c2ed6ab	fix validation to consider cpu-only ops Summary: Data paralell model has a sanity check that ensures that operators inputs/outputs do not cross device boundaries. This failed when the operator was a CPU-only operator (such as the new AccuracyOp version). This fixes that. Reviewed By: prigoyal Differential Revision: D4417841 fbshipit-source-id: 9bc4e7a2074a544ca4db69ecf24183bbd41f84ca	2017-01-13 18:59:32 -08:00
Aapo Kyrola	95b3309a87	Gradient Input memory sharing using memonger blob sharing Summary: This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs. In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock. The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough. Module data_parallel_model supports this feature natively. Reviewed By: prigoyal Differential Revision: D4363209 fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1	2017-01-09 19:44:23 -08:00
Priya Goyal	29f903aaf2	Make computed params broadcast optional Summary: this was introduced due to rm and riv params in SpatialBN layer and the likes. We should be saving these params as well but it is not required to broadcast these params to all gpus after every epoch. Differential Revision: D4338749 fbshipit-source-id: d3bbc92cf0cd7d220a51d76aea8bffcfd6e520b7	2016-12-16 07:59:25 -08:00
Aapo Kyrola	fc27f83282	restore control_input Summary: I accidentally landed in D4327024 the control_input disable for NCCL. This empirically increases likelihood of deadlocks, although gives a nice perf boost. But better to disable before NVIDIA fixes their stuff. Reviewed By: Yangqing Differential Revision: D4338537 fbshipit-source-id: d43efb45965a88bcfe38e5f1dc16c04463e2e038	2016-12-15 21:29:29 -08:00
Aapo Kyrola	2bf18f2b1d	add inception and dummy input Summary: As requested by Yangqing, added Inception model (copied from convnet_benchmarks) and a dummy data feed option to the xray trainer, that we use for scalability benchmarking. + a couple of minichanges to the data input framework Reviewed By: Yangqing Differential Revision: D4327024 fbshipit-source-id: 86911468456fc13a32d5f437a43347380ec66a68	2016-12-15 13:40:22 -08:00
Priya Goyal	cb918ac727	Implementation of ResNets on imagenet dataset Summary: adding imagenet dataset as well data augmentation and model has been added, just need to add db read Differential Revision: D4289150 fbshipit-source-id: b531d3f09e3d0efac5cda5bb75d8146e1bb693e4	2016-12-15 12:01:31 -08:00
Aapo Kyrola	eddf23ca0f	Handle parameters that are computed but not optimized Summary: prigoyal sharply noticed a bug in the Resnet models: we have not been checkpointing, nor synchronizing between gpus, the moving average and variance computed by the SpatialBN ops. Particularly the first problen is serious, since models starting from checkpoint would have started from a null-state for SpatialBN. Not synchronizing with the data parallel model is less tragic since each GPU should see very similar data. Thus I propose keeping track of "computed params", i.e params that are computed from data but not optimized. I don't know if there are other examples, but SpatialBN's moving avg and var definitely are one. - I modified the checkpointign for xray model to store those blobs + also ensure the synchronization of those blobs - I modified data parallel model to broadcast those params from gpu0. I first tried averaging, but hit some NCCL deadlocks ... :( Differential Revision: D4281265 fbshipit-source-id: 933311afeec4b7e9344a13cf2d38aa939c50ac31	2016-12-15 12:01:28 -08:00
Byung-Gon Chun	1aba4280d8	Make xray net_type configurable Summary: Make xray net_type configub a command line argument Differential Revision: D4262076 fbshipit-source-id: e2ecb9cd5bee5d6aaebe0ea8d2d4d9b378058cba	2016-12-05 11:53:27 -08:00
Aapo Kyrola	96a5e88d63	Fix consequtive checkpoint syncs Summary: Switching to Pieter-MPI changed the way we setup network between operators. For syncronizing parameters after a checkpoint load, we run a checkpoint_net that contaiend operators for creating the common world and broadcast operators. Unfortunately this fails when the checkpoint sync is done a second time, because we would have created a duplicate common world. Solution is to separate common world op and broadcast op to init net and the actual broadcasting net, and we run the init net only once. This problem did not arise in the Flow version since I did only one checkpoint loading per operator (process). Differential Revision: D4251754 fbshipit-source-id: ba030579e651e529e29bbf2d27920075078d8ff9	2016-12-05 11:53:26 -08:00
Aapo Kyrola	3410939459	pass learning rate scaling factor to parameter update builder function Summary: When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus. Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size. Reviewed By: prigoyal Differential Revision: D4248907 fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be	2016-12-05 11:53:26 -08:00
Aapo Kyrola	5d0167c8e7	Example workflow for running disributed (syncsgd) imagenet training in Flow Summary: This diff introduces a simplified Imagenet trainer that uses data_parallel_model to parallellize training over GPUs and Nodes in synchronous manner. Flow's gang scheduling is used to launch the nodes, and data_parallel_model handles the synchronization among the gang members. This example also uses the operator-per-epoch model where each epoch produces a checkpoint consumed by the followup epoch. Reviewed By: salexspb Differential Revision: D4223384 fbshipit-source-id: 8c2c73f4f6b2fdadb98511075ebbd8426c91eadb	2016-11-29 15:18:38 -08:00
Aapo Kyrola	365ca8da1c	add sanity check that ops do not cross gpus Summary: Debugging nets can be tiresome, so it is good if we can do some sanity checks. This adds a sanity check that all non-NCCL and non-Copy operators do not reference blobs that have different device scope than the operator. This check is only added to the data_parallel_model, so it should be safe. This check would had caught a subtle bugin prigoyal's training pipeline. Reviewed By: dzhulgakov Differential Revision: D4230444 fbshipit-source-id: 3d4a843162134a7a504053d95ff97a552e6b8a6d	2016-11-29 15:18:38 -08:00
Aapo Kyrola	42279a610c	use Pieter-MPI and fb.distributed Summary: Remove MPI and use fb.distributed rendezvous and Pieter's new Ops. One now can pass a 'rendezvous' struct to data_parallel_model to initiate distributed SyncSGD. Provided rendezvoud implementation uses the kv-store handler of fb.distributed to disseminate information about other hosts. We can easily add other rendezvous, such as file-based, but that is topic of another diff. Removing MPI allowed also simplifiying of Xray startup scripts, which are included in this diff. When accepted, I will work on a simple example code so others can use this stuff as well. Also Flow implementation will be topic of next week. Differential Revision: D4180012 fbshipit-source-id: 9e74f1fb43eaf7d4bb3e5ac6718d76bef2dfd731	2016-11-29 15:18:36 -08:00
Yangqing Jia	589398950f	fbsync at f5a877	2016-11-18 15:41:06 -08:00
Yangqing Jia	238ceab825	fbsync. TODO: check if build files need update.	2016-11-15 00:00:46 -08:00
Yangqing Jia	44509f9f91	fbsync: mostly lint changes, added mkl files	2016-10-11 22:45:06 -07:00
Yangqing Jia	d1e9215184	fbsync	2016-10-07 13:08:53 -07:00

37 Commits