pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Lukasz Wesolowski	29a4c942fe	Add support for multi-device batch normalization through an option to data_parallel_model Summary: Stage 3 in stack of diffs for supporting multi-device batch normalization. Adds input parameter to data_parallel_model to enable multi-device batch normalization. Depends on D6699258. Reviewed By: pietern Differential Revision: D6700387 fbshipit-source-id: 24ed62915483fa4da9b1760eec0c1ab9a64b94f8	2018-01-24 13:24:06 -08:00
Aapo Kyrola	2caca70a37	Allow shifting of activations / ops to other GPUs in data parallel model Summary: (Work in progress). This diff will allow shifting of activations to other GPUs, in case the model does not fit into memory. To see the API, check the code in data_parallel_model_test, which tests shifting two activations from 0 and 1 to gpu 4, and from gpu 2 and 3 to gpu 5. I will need to further test on ResNets, and probablly add copy operations to handle device change points. Reviewed By: asaadaldien Differential Revision: D5591674 fbshipit-source-id: eb12d23651a56d64fa4db91090c6474218705270	2017-11-29 21:17:00 -08:00
Matthias Ochs	14cc15e8f4	fixed NCCL bug in data_parallel_model.py Summary: Changed the dict of viewvalues into a python list See issue: https://github.com/caffe2/caffe2/issues/1516 Closes https://github.com/caffe2/caffe2/pull/1532 Differential Revision: D6425901 Pulled By: akyrola fbshipit-source-id: 37988abe29726aea86637e18eedb948b7c281008	2017-11-28 10:50:02 -08:00
Qinqing Zheng	4471e15b76	BMUF cpu support Summary: change the interface so BMUF can run on cpus Reviewed By: asaadaldien Differential Revision: D6356026 fbshipit-source-id: f58a4da9f800d969145a1a376e118b0f3581f8c1	2017-11-19 23:41:25 -08:00
Aapo Kyrola	1a02e72254	fix missing DPM .values() and .keys() to viewvalues() and viewkeys() Summary: Reported by SImon Layton from NVIDIA: we had a couple of py3-incompatible expresions in data_parallel_model Reviewed By: azzolini Differential Revision: D6349447 fbshipit-source-id: a09feb69396be43296400591a3bfed5b8c370b0d	2017-11-16 16:08:18 -08:00
Aapo Kyrola	e9cc41885e	fix dynamic memory management for distributed execution Summary: Dynamic memory management in Data Parallel Model was broken for distributed computation because it also the parameter gradients where freed after been used. That is problem with GLOO because it expects the tensors to have the same address over multiple calls. It is not a huge loss to remove parameter gradients from recycling as they are relatively small for typical convnets. Reviewed By: asaadaldien Differential Revision: D6314095 fbshipit-source-id: 949161d8c592927ae2fa82b3262b5f9ee47bed6f	2017-11-13 12:09:11 -08:00
Aapo Kyrola	cec27b8134	AddDistributedBlobsSync Summary: Added a simple function to synchronize a blob across machines (but not across devices), i.e a blobs that are not synced over devices. Reviewed By: yqwangustc Differential Revision: D6192922 fbshipit-source-id: a4d653c9fb09f06b0c42330bdae07b42f5e6346c	2017-10-30 22:33:29 -07:00
Dmytro Dzhulgakov	2972a6ca02	Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger" Summary: This reverts commit 95c634872ac02be721257169e38c8fead04cd66b bypass-lint Differential Revision: D6026557 fbshipit-source-id: 663c28583ce3b01070ff5449115ed7e222f71776	2017-10-12 20:21:52 -07:00
Aapo Kyrola	d748c43f71	for dpm.GetLearningRateBlobNames Summary: I broke dpm.GetLearningRateBlobNames() when adding a new nodename param in optimizer. Fixing it. Reviewed By: asaadaldien Differential Revision: D6043828 fbshipit-source-id: b3a79dd0dfae144187bcb359e2374eab6b32c485	2017-10-12 17:20:33 -07:00
Luke Yeager	75bece6ede	Fix "No handlers could be found for logger" Summary: Closes https://github.com/caffe2/caffe2/pull/1316 Differential Revision: D6026557 Pulled By: Yangqing fbshipit-source-id: 95c634872ac02be721257169e38c8fead04cd66b	2017-10-10 22:32:13 -07:00
Andrey Malevich	e13f199452	Switch RNNOp to use NetDef argument for step represenetation. Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well. Reviewed By: salexspb Differential Revision: D5949330 fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f	2017-10-10 22:01:51 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Luke Yeager	ec801d535c	Fix typo in warning in data_parallel_model Summary: Closes https://github.com/caffe2/caffe2/pull/1219 Differential Revision: D5898077 Pulled By: Yangqing fbshipit-source-id: 7ee726ef3399a350a36e77093cbad0f70f8f3dce	2017-09-22 23:03:28 -07:00
Ahmed Taei	c3a3d6ceba	Add an option to use dynamic memory optimizer. Reviewed By: akyrola Differential Revision: D5869664 fbshipit-source-id: ab11bc27395bf10e8381ebf97e6afb83ae9af81f	2017-09-20 12:52:55 -07:00
Aapo Kyrola	9ec981b866	for CPU-data parallel, allow sharing model Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode. Reviewed By: wesolwsk Differential Revision: D5812181 fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea	2017-09-15 16:19:37 -07:00
Aapo Kyrola	ce36a972b0	fix timeouts in CloneOrCreateCommonWorld Summary: Default value for timeout in CreateOrCloneCommonWorld does not work properly: if the value of dpm._DEFAULT_TIMEOUT is changed, the default still stays as old 30s. Changed to use None instead as default. Reviewed By: pietern Differential Revision: D5813228 fbshipit-source-id: f617ceec40a03893c27d3e13c426e1ca6b2114e2	2017-09-12 13:09:05 -07:00
Aapo Kyrola	93bd3c77f8	AddBlobsSync() Summary: Explicit function to sync blobs. Notice that this must be called before CreateNet(), and syncs the blobs every run. Reviewed By: asaadaldien, jay-mahadeokar Differential Revision: D5805891 fbshipit-source-id: 58a1bb47805d75d5cbead136e2e0e9fe663ea954	2017-09-12 10:33:22 -07:00
Pieter Noordhuis	84167faf0f	Enable use of GPUDirect through argument to Gloo AllreduceOp Summary: If the Gloo InfiniBand transport is used, the Gloo algorithms can use GPUDirect to DMA directly from/to GPU memory. This is done through the CudaDeviceWorkspace. This change adds a "gpu_direct" option to the Allreduce operator that makes it use GPUDirect if the transport supports it. Closes https://github.com/caffe2/caffe2/pull/1203 Reviewed By: wesolwsk Differential Revision: D5806366 Pulled By: pietern fbshipit-source-id: 9e9a78f059f2b5c6e4fbf6574b7db4776a94696c	2017-09-11 13:02:58 -07:00
Pieter Noordhuis	d43ab4bec5	Create Gloo common world through MPI rendezvous Summary: Before this change there were two ways for machines to rendezvous for a distributed run: shared file system or Redis. If you're using an MPI cluster it is much more convenient to simply execute mpirun and expect the "right thing (tm)" to happen. This change adds the "mpi_rendezvous" option to the CreateCommonWorld operator. If this is set, the common world size and rank will be pulled from the MPI context and Gloo rendezvous takes place using MPI. Note that this does NOT mean the MPI BTL is used; MPI is only used for rendezvous. Closes https://github.com/caffe2/caffe2/pull/1190 Reviewed By: akyrola Differential Revision: D5796060 Pulled By: pietern fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226	2017-09-08 17:18:47 -07:00
Pieter Noordhuis	b8eb8ced7d	Add transport/interface arguments to CreateCommonWorld operator Summary: These arguments control which Gloo transport (TCP or IB) and which network interface is used for the common world. If not specified, it defaults to using TCP and the network interface for the IP that the machine's hostname resolves to. The valid values for the transport argument are "tcp" and "ibverbs". For ibverbs to work, Gloo must have been compiled with ibverbs support. If Gloo is built as part of Caffe2 (sourced from the third_party directory), then you can pass -DUSE_IBVERBS=ON to CMake to enable ibverbs support in Gloo. Closes https://github.com/caffe2/caffe2/pull/1177 Reviewed By: akyrola Differential Revision: D5789729 Pulled By: pietern fbshipit-source-id: 0dea1a115c729e54c5c1f9fdd5fb29c14a834a82	2017-09-08 10:57:41 -07:00
Aapo Kyrola	b7997a0f41	support device ids>10 Summary: Data parallel model failed with device numbers 10, 11.. because it used string sorting of the blob names. Changed to make sorting happen based on device number and then blob name. Also added reduction for 16 devices. Reviewed By: wesolwsk Differential Revision: D5781521 fbshipit-source-id: 16be0984ecb55340604c82893be366c0528e822c	2017-09-07 00:01:33 -07:00
Pieter Noordhuis	6d5c3eaeb7	Add CloneCommonWorld op Summary: Cloning was previously done by overloading CreateCommonWorld op. Closes https://github.com/caffe2/caffe2/pull/1159 Reviewed By: andrewwdye Differential Revision: D5757580 Pulled By: pietern fbshipit-source-id: 9e80b295e390bf92623bafb72be21cbafdcf2ff4	2017-09-06 13:32:30 -07:00
Wojciech Glogowski	a7ec5def7b	data_parallel_model names fix Summary: Updated usage of deprecated functions in data_parallel_model.py Reviewed By: akyrola Differential Revision: D5738512 fbshipit-source-id: a7767e518da777ece058bcad480e5df1d91e9b42	2017-08-30 12:47:14 -07:00
Aapo Kyrola	7fad4be4c6	Device-specific memongering Summary: Enforce that blobs don't mix between operators on different GPUs or CPU/GPU. Add test. + Fix memonger when no namescope is provided. Reviewed By: asaadaldien Differential Revision: D5644708 fbshipit-source-id: 0cb361efd6361b6e2138462584bab6b4de039b5d	2017-08-17 13:31:26 -07:00
Alexander Sidorov	52befa4802	DataParallelModel: take param_init_net into account in _InferBlobDevice Summary: Here is my example: For static RNN timestep is created as a part of param_init_net. Before DPM assumed that it is CUDA blob by default and it participated in broadcasting causing Copy on line 798 to fail. No device mapping is correct for this blob. Reviewed By: akyrola Differential Revision: D5631716 fbshipit-source-id: 28c3eb17ecc3080c95c41d69a60bf7262d3907d4	2017-08-15 12:06:46 -07:00
Zhaoming Wu	399fc9fb09	Added Nesterov Summary: Added Nesterov momentum as an option for BMUF and corresponding tests Reviewed By: asaadaldien Differential Revision: D5599888 fbshipit-source-id: 30819c9e689347c8b75daddc7444bea9f54193ae	2017-08-11 13:52:43 -07:00
Priya Goyal	5c77cc8182	Exposing num_workers as parameter and enable recycling activations Summary: as promised, a separate diff for dpm changes I made in experimental code Reviewed By: pietern Differential Revision: D5551304 fbshipit-source-id: 9013aeab6c388b1c415ffb2e36fb8dd6b8cf90b0	2017-08-08 19:48:41 -07:00
Ahmed Taei	647f35e742	Fix SyncAllParamsDistributed for Python 3x Summary: In Python 3x dictionary values aren't a list and can't be concatenated to a list this diff should fix that. Reviewed By: andrewwdye Differential Revision: D5576724 fbshipit-source-id: c60441857ceceb9c4a71122d2db5e9abad6d3fc2	2017-08-07 14:23:32 -07:00
Aapo Kyrola	26645154bb	warn about using test/val model with init_params=True + fixed some cases Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...). Reviewed By: asaadaldien Differential Revision: D5509963 fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f	2017-07-27 13:20:27 -07:00
Aapo Kyrola	af1e45c1e1	support appending net and converting them Summary: As per rushabhmshah99 request: he wants to append a pre-trained model (without training that) to the model. So added data_parallel_model.ConvertNetForDevice() to enable that. The unit test shows example how to use this with AppendNet, and I also added a blurb to the function. Differential Revision: D5503335 fbshipit-source-id: b2a5db5c1739dc97f46dd0d7606ed555d99255b8	2017-07-27 11:07:48 -07:00
Aapo Kyrola	3363681304	enable CreateCommonWorld to bootstrap from existing common world Summary: Use romain-intel's ContextFactory to create common worlds from existing common worlds, thus bypassing KV store completely. Changed data_parallel_model to automatically find if there is already a CW we can work. CreateCommonWorldOp takes optional second parameter, which is existing CW. Reviewed By: andrewwdye Differential Revision: D5494956 fbshipit-source-id: 5f7a840bcd5fe4ea756fafeacc746bc2cf5078b0	2017-07-26 22:31:55 -07:00
Ahmed Taei	804ebf7c41	Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example. Reviewed By: akyrola Differential Revision: D5463772 fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d	2017-07-21 16:24:10 -07:00
Geet Sethi	11c4647447	Allow CPU device scope in data_parallel_model and data_parallel_rendevous device scope checks Summary: Allowing CPU device scope instead of enforcing no device scope in data_parallel_model and data_parallel_rendevous. Reviewed By: akyrola Differential Revision: D5440492 fbshipit-source-id: bcd4344d64c710ea50ec8a65e3e9d102e35c66ea	2017-07-18 15:47:41 -07:00
Geet Sethi	ab0d631d6d	Adding AllCompare-like function to data_parallel_model Summary: Added function _RunComparison to data_parallel_model that checks if all shards in a given rendevous have the same value for a given blob_name Reviewed By: wesolwsk Differential Revision: D5394164 fbshipit-source-id: c2b07d0f8d5846fa9887d53b0be091a8c057f106	2017-07-13 13:03:57 -07:00
Geet Sethi	a68bb5e3f9	Added device scope checks to data_parallel_model and data_parallel_rendevous Summary: Added device scope checks to data_parallel_model and data_parallel_rendevous Added test to check that checks are working correctly to data_parallel_model_test Fixed device_scope error in test_synchronization_barrier Reviewed By: akyrola Differential Revision: D5403936 fbshipit-source-id: 849c1cd7452692efbc5ef74d2d60ede090c9c017	2017-07-12 10:47:28 -07:00
Ralph Mao	febae7b20b	fix a bug in the report function of Data_Parallel Summary: replace params with sp, otherwise it will report an empty list Reviewed By: akyrola Differential Revision: D5382716 fbshipit-source-id: 34d8e6ee00cbe1718702e3d1f23ea12f8d65063e	2017-07-07 13:03:46 -07:00
Andrew Dye	31f394f8b3	Add synchronization barrier API to data parallel model Summary: Add synchronization barrier API with configurable timeout. Users can call Synchronize() to join variable length execution before resuming multi-machine communication steps, i.e., resuming distributed training iterations after validation on a single machine. Reviewed By: akyrola Differential Revision: D5348387 fbshipit-source-id: 5826da10e6a60c50394c36c7cf47624f10191d11	2017-07-06 09:21:19 -07:00
Aapo Kyrola	2d133d4627	increase concurrency default Summary: Huge improvement in my tests, and it does not really hurt either. Reviewed By: wesolwsk Differential Revision: D5374925 fbshipit-source-id: c96a4ed2ca653120a82233c0037cbfded8a2d2a1	2017-07-05 21:46:31 -07:00
Simon Layton	090506ac87	Add NCCLBroadcast to correct net Summary: Otherwise was always added to main net instead of param_init_net when desired (i.e. initial param sync) Closes https://github.com/caffe2/caffe2/pull/894 Differential Revision: D5367451 Pulled By: akyrola fbshipit-source-id: 3d82be6da687c736bd15f4852dbd272266eb4811	2017-07-03 16:54:44 -07:00
Aapo Kyrola	8c74c36626	fix reducing device option Summary: This was broken in a previous diff, fixing it to use model device type. Reviewed By: asaadaldien Differential Revision: D5356005 fbshipit-source-id: a4fcc932bae772076b57625a5fcc0d38eb702cc9	2017-06-30 09:19:57 -07:00
Thomas Dudziak	5355634dac	Dict fixes/improvements and unittest targets for Python 3 in caffe2 core Summary: As title Reviewed By: salexspb Differential Revision: D5316104 fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30	2017-06-29 17:05:41 -07:00
Yongqiang Wang	ea659b8f2e	broadcast to global parameters when using warmup Reviewed By: asaadaldien, jay-mahadeokar Differential Revision: D5340692 fbshipit-source-id: 80879847ff71c8d620de502ef95a9ffb4bdf595d	2017-06-28 13:35:27 -07:00
Ahmed Taei	fbe2526343	Allow concurrent execution of GLOO broadcast collectives in Summary: This add CollectivesConcurrencyControl class to mange creating common context and cyclic controls to execute GLOO collectivces and refactors AllReduce and _AddDistributedParamterSync to use it Reviewed By: akyrola Differential Revision: D5335795 fbshipit-source-id: 5084e0a65cdb989cd949be3868b77a680561022d	2017-06-28 12:49:12 -07:00
Henry Lu	9a14c013c3	Refactor data_parallel_model to take advantage of Gloo broadcast op in broadcasting across machines and GPUs in one operation Summary: Combine _AddDistributedParameterSync() and _SyncParams() into a single function to broadcast across distributes machines and all local GPU simultaneously. This is similar to how calls to Allreduce has already optimized using the functionalities of Gloo. All the refactoring work is contained in data_parallel_model.py. Reviewed By: akyrola, andrewwdye Differential Revision: D5329277 fbshipit-source-id: 4407b88980cf396f2e0f994d796294fa79fd39ed	2017-06-27 19:35:24 -07:00
Simon Layton	d45f722e43	data_parallel_model: NCCLBroadcast root fix Summary: The root is the root _rank_ and not the root _device_. Thus we always use root=0, regardless of the devices used. https://github.com/NVIDIA/nccl/blob/v1.3.0-1/src/broadcast.cu#L75 /cc slayton58 Closes https://github.com/caffe2/caffe2/pull/872 Differential Revision: D5329564 Pulled By: akyrola fbshipit-source-id: 5a34be30c1a0046a74f28437cb08333c1fb46098	2017-06-27 09:47:48 -07:00
Jay Mahadeokar	04c9c8c5c2	fix for loading model with bmuf Summary: - One line fix for loading saved checkpoint when using Parallelize_GPU_BMUF Reviewed By: asaadaldien Differential Revision: D5315254 fbshipit-source-id: a20ba6438c8e6b2ef44b65270c1d3f9ab645ded0	2017-06-23 17:16:33 -07:00
Thomas Dudziak	342de07231	Core unit test fixes for Python 3 Summary: As title Differential Revision: D5291327 fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf	2017-06-23 13:22:16 -07:00
Ahmed Taei	5ca263fb1c	Add a warmup option for BMUF Reviewed By: yqwangustc Differential Revision: D5279655 fbshipit-source-id: 7c778a88909580bbe43d4bac4b7d73be0d0e3f27	2017-06-22 14:32:39 -07:00
Ahmed Taei	ffd32c8ab7	Add distributed BMUF implementation. Summary: Refactor data_parallel_model all_reduce and broadcast methods to work for a given parameter set not only gradients and reuse them for BMUF distributed implementation. Add a distributed test (multiprocessing) to BMUF. Reviewed By: akyrola Differential Revision: D5267083 fbshipit-source-id: 8dcc7527d0a755b903d693d8071585f0b54d3403	2017-06-21 16:18:11 -07:00
Aapo Kyrola	34eaa19d27	CPU data parallel model Summary: CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs). Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later. Reviewed By: wesolwsk Differential Revision: D5277350 fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210	2017-06-20 23:19:08 -07:00

1 2 3

111 Commits