pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Andrey Malevich	e13f199452	Switch RNNOp to use NetDef argument for step represenetation. Summary: Before this diff RNNOp was using TextFormat for representing steps. This diff is changing RNNOp to prefer NetDef argument instead. To be backward compatible it supports TextFormat for existing models, though we can compile RNNs without TextFormat as well. Reviewed By: salexspb Differential Revision: D5949330 fbshipit-source-id: 9336a8f5ccf30ad8d8e3a7067b9437e1704b1c9f	2017-10-10 22:01:51 -07:00
Yangqing Jia	8286ce1e3a	Re-license to Apache Summary: Closes https://github.com/caffe2/caffe2/pull/1260 Differential Revision: D5906739 Pulled By: Yangqing fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902	2017-09-28 16:22:00 -07:00
Luke Yeager	ec801d535c	Fix typo in warning in data_parallel_model Summary: Closes https://github.com/caffe2/caffe2/pull/1219 Differential Revision: D5898077 Pulled By: Yangqing fbshipit-source-id: 7ee726ef3399a350a36e77093cbad0f70f8f3dce	2017-09-22 23:03:28 -07:00
Ahmed Taei	c3a3d6ceba	Add an option to use dynamic memory optimizer. Reviewed By: akyrola Differential Revision: D5869664 fbshipit-source-id: ab11bc27395bf10e8381ebf97e6afb83ae9af81f	2017-09-20 12:52:55 -07:00
Aapo Kyrola	9ec981b866	for CPU-data parallel, allow sharing model Summary: On CPU, no need to replicate parameters. So try using only one copy (cpu_0) for parameters. Made resnet50_trainer use shared model in cpu mode. Reviewed By: wesolwsk Differential Revision: D5812181 fbshipit-source-id: 93254733edbc4a62bd74a629a68f5fa23f7e96ea	2017-09-15 16:19:37 -07:00
Aapo Kyrola	ce36a972b0	fix timeouts in CloneOrCreateCommonWorld Summary: Default value for timeout in CreateOrCloneCommonWorld does not work properly: if the value of dpm._DEFAULT_TIMEOUT is changed, the default still stays as old 30s. Changed to use None instead as default. Reviewed By: pietern Differential Revision: D5813228 fbshipit-source-id: f617ceec40a03893c27d3e13c426e1ca6b2114e2	2017-09-12 13:09:05 -07:00
Aapo Kyrola	93bd3c77f8	AddBlobsSync() Summary: Explicit function to sync blobs. Notice that this must be called before CreateNet(), and syncs the blobs every run. Reviewed By: asaadaldien, jay-mahadeokar Differential Revision: D5805891 fbshipit-source-id: 58a1bb47805d75d5cbead136e2e0e9fe663ea954	2017-09-12 10:33:22 -07:00
Pieter Noordhuis	84167faf0f	Enable use of GPUDirect through argument to Gloo AllreduceOp Summary: If the Gloo InfiniBand transport is used, the Gloo algorithms can use GPUDirect to DMA directly from/to GPU memory. This is done through the CudaDeviceWorkspace. This change adds a "gpu_direct" option to the Allreduce operator that makes it use GPUDirect if the transport supports it. Closes https://github.com/caffe2/caffe2/pull/1203 Reviewed By: wesolwsk Differential Revision: D5806366 Pulled By: pietern fbshipit-source-id: 9e9a78f059f2b5c6e4fbf6574b7db4776a94696c	2017-09-11 13:02:58 -07:00
Pieter Noordhuis	d43ab4bec5	Create Gloo common world through MPI rendezvous Summary: Before this change there were two ways for machines to rendezvous for a distributed run: shared file system or Redis. If you're using an MPI cluster it is much more convenient to simply execute mpirun and expect the "right thing (tm)" to happen. This change adds the "mpi_rendezvous" option to the CreateCommonWorld operator. If this is set, the common world size and rank will be pulled from the MPI context and Gloo rendezvous takes place using MPI. Note that this does NOT mean the MPI BTL is used; MPI is only used for rendezvous. Closes https://github.com/caffe2/caffe2/pull/1190 Reviewed By: akyrola Differential Revision: D5796060 Pulled By: pietern fbshipit-source-id: f8276908d3f3afef2ac88594ad377e38c17d0226	2017-09-08 17:18:47 -07:00
Pieter Noordhuis	b8eb8ced7d	Add transport/interface arguments to CreateCommonWorld operator Summary: These arguments control which Gloo transport (TCP or IB) and which network interface is used for the common world. If not specified, it defaults to using TCP and the network interface for the IP that the machine's hostname resolves to. The valid values for the transport argument are "tcp" and "ibverbs". For ibverbs to work, Gloo must have been compiled with ibverbs support. If Gloo is built as part of Caffe2 (sourced from the third_party directory), then you can pass -DUSE_IBVERBS=ON to CMake to enable ibverbs support in Gloo. Closes https://github.com/caffe2/caffe2/pull/1177 Reviewed By: akyrola Differential Revision: D5789729 Pulled By: pietern fbshipit-source-id: 0dea1a115c729e54c5c1f9fdd5fb29c14a834a82	2017-09-08 10:57:41 -07:00
Aapo Kyrola	b7997a0f41	support device ids>10 Summary: Data parallel model failed with device numbers 10, 11.. because it used string sorting of the blob names. Changed to make sorting happen based on device number and then blob name. Also added reduction for 16 devices. Reviewed By: wesolwsk Differential Revision: D5781521 fbshipit-source-id: 16be0984ecb55340604c82893be366c0528e822c	2017-09-07 00:01:33 -07:00
Pieter Noordhuis	6d5c3eaeb7	Add CloneCommonWorld op Summary: Cloning was previously done by overloading CreateCommonWorld op. Closes https://github.com/caffe2/caffe2/pull/1159 Reviewed By: andrewwdye Differential Revision: D5757580 Pulled By: pietern fbshipit-source-id: 9e80b295e390bf92623bafb72be21cbafdcf2ff4	2017-09-06 13:32:30 -07:00
Wojciech Glogowski	a7ec5def7b	data_parallel_model names fix Summary: Updated usage of deprecated functions in data_parallel_model.py Reviewed By: akyrola Differential Revision: D5738512 fbshipit-source-id: a7767e518da777ece058bcad480e5df1d91e9b42	2017-08-30 12:47:14 -07:00
Aapo Kyrola	7fad4be4c6	Device-specific memongering Summary: Enforce that blobs don't mix between operators on different GPUs or CPU/GPU. Add test. + Fix memonger when no namescope is provided. Reviewed By: asaadaldien Differential Revision: D5644708 fbshipit-source-id: 0cb361efd6361b6e2138462584bab6b4de039b5d	2017-08-17 13:31:26 -07:00
Alexander Sidorov	52befa4802	DataParallelModel: take param_init_net into account in _InferBlobDevice Summary: Here is my example: For static RNN timestep is created as a part of param_init_net. Before DPM assumed that it is CUDA blob by default and it participated in broadcasting causing Copy on line 798 to fail. No device mapping is correct for this blob. Reviewed By: akyrola Differential Revision: D5631716 fbshipit-source-id: 28c3eb17ecc3080c95c41d69a60bf7262d3907d4	2017-08-15 12:06:46 -07:00
Zhaoming Wu	399fc9fb09	Added Nesterov Summary: Added Nesterov momentum as an option for BMUF and corresponding tests Reviewed By: asaadaldien Differential Revision: D5599888 fbshipit-source-id: 30819c9e689347c8b75daddc7444bea9f54193ae	2017-08-11 13:52:43 -07:00
Priya Goyal	5c77cc8182	Exposing num_workers as parameter and enable recycling activations Summary: as promised, a separate diff for dpm changes I made in experimental code Reviewed By: pietern Differential Revision: D5551304 fbshipit-source-id: 9013aeab6c388b1c415ffb2e36fb8dd6b8cf90b0	2017-08-08 19:48:41 -07:00
Ahmed Taei	647f35e742	Fix SyncAllParamsDistributed for Python 3x Summary: In Python 3x dictionary values aren't a list and can't be concatenated to a list this diff should fix that. Reviewed By: andrewwdye Differential Revision: D5576724 fbshipit-source-id: c60441857ceceb9c4a71122d2db5e9abad6d3fc2	2017-08-07 14:23:32 -07:00
Aapo Kyrola	26645154bb	warn about using test/val model with init_params=True + fixed some cases Summary: It is common mistake to create test/validation model with init_params=True. When its param_init_net is run, it will overwrite training models' params, and with DPM, those won't be synchronized to all GPUs. I don't want to make this an assertion yet, since it might break people's trainers (it is ok to have init_params=True if you never run the param_init_net...). Reviewed By: asaadaldien Differential Revision: D5509963 fbshipit-source-id: 63b1a16ec0af96e3790e226850f6e0e64689143f	2017-07-27 13:20:27 -07:00
Aapo Kyrola	af1e45c1e1	support appending net and converting them Summary: As per rushabhmshah99 request: he wants to append a pre-trained model (without training that) to the model. So added data_parallel_model.ConvertNetForDevice() to enable that. The unit test shows example how to use this with AppendNet, and I also added a blurb to the function. Differential Revision: D5503335 fbshipit-source-id: b2a5db5c1739dc97f46dd0d7606ed555d99255b8	2017-07-27 11:07:48 -07:00
Aapo Kyrola	3363681304	enable CreateCommonWorld to bootstrap from existing common world Summary: Use romain-intel's ContextFactory to create common worlds from existing common worlds, thus bypassing KV store completely. Changed data_parallel_model to automatically find if there is already a CW we can work. CreateCommonWorldOp takes optional second parameter, which is existing CW. Reviewed By: andrewwdye Differential Revision: D5494956 fbshipit-source-id: 5f7a840bcd5fe4ea756fafeacc746bc2cf5078b0	2017-07-26 22:31:55 -07:00
Ahmed Taei	804ebf7c41	Populate learning rate blob name into data_parallel_model and fix resnet50_trainer example. Reviewed By: akyrola Differential Revision: D5463772 fbshipit-source-id: 10b8963af778503a3de6edbabb869747bd1e986d	2017-07-21 16:24:10 -07:00
Geet Sethi	11c4647447	Allow CPU device scope in data_parallel_model and data_parallel_rendevous device scope checks Summary: Allowing CPU device scope instead of enforcing no device scope in data_parallel_model and data_parallel_rendevous. Reviewed By: akyrola Differential Revision: D5440492 fbshipit-source-id: bcd4344d64c710ea50ec8a65e3e9d102e35c66ea	2017-07-18 15:47:41 -07:00
Geet Sethi	ab0d631d6d	Adding AllCompare-like function to data_parallel_model Summary: Added function _RunComparison to data_parallel_model that checks if all shards in a given rendevous have the same value for a given blob_name Reviewed By: wesolwsk Differential Revision: D5394164 fbshipit-source-id: c2b07d0f8d5846fa9887d53b0be091a8c057f106	2017-07-13 13:03:57 -07:00
Geet Sethi	a68bb5e3f9	Added device scope checks to data_parallel_model and data_parallel_rendevous Summary: Added device scope checks to data_parallel_model and data_parallel_rendevous Added test to check that checks are working correctly to data_parallel_model_test Fixed device_scope error in test_synchronization_barrier Reviewed By: akyrola Differential Revision: D5403936 fbshipit-source-id: 849c1cd7452692efbc5ef74d2d60ede090c9c017	2017-07-12 10:47:28 -07:00
Ralph Mao	febae7b20b	fix a bug in the report function of Data_Parallel Summary: replace params with sp, otherwise it will report an empty list Reviewed By: akyrola Differential Revision: D5382716 fbshipit-source-id: 34d8e6ee00cbe1718702e3d1f23ea12f8d65063e	2017-07-07 13:03:46 -07:00
Andrew Dye	31f394f8b3	Add synchronization barrier API to data parallel model Summary: Add synchronization barrier API with configurable timeout. Users can call Synchronize() to join variable length execution before resuming multi-machine communication steps, i.e., resuming distributed training iterations after validation on a single machine. Reviewed By: akyrola Differential Revision: D5348387 fbshipit-source-id: 5826da10e6a60c50394c36c7cf47624f10191d11	2017-07-06 09:21:19 -07:00
Aapo Kyrola	2d133d4627	increase concurrency default Summary: Huge improvement in my tests, and it does not really hurt either. Reviewed By: wesolwsk Differential Revision: D5374925 fbshipit-source-id: c96a4ed2ca653120a82233c0037cbfded8a2d2a1	2017-07-05 21:46:31 -07:00
Simon Layton	090506ac87	Add NCCLBroadcast to correct net Summary: Otherwise was always added to main net instead of param_init_net when desired (i.e. initial param sync) Closes https://github.com/caffe2/caffe2/pull/894 Differential Revision: D5367451 Pulled By: akyrola fbshipit-source-id: 3d82be6da687c736bd15f4852dbd272266eb4811	2017-07-03 16:54:44 -07:00
Aapo Kyrola	8c74c36626	fix reducing device option Summary: This was broken in a previous diff, fixing it to use model device type. Reviewed By: asaadaldien Differential Revision: D5356005 fbshipit-source-id: a4fcc932bae772076b57625a5fcc0d38eb702cc9	2017-06-30 09:19:57 -07:00
Thomas Dudziak	5355634dac	Dict fixes/improvements and unittest targets for Python 3 in caffe2 core Summary: As title Reviewed By: salexspb Differential Revision: D5316104 fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30	2017-06-29 17:05:41 -07:00
Yongqiang Wang	ea659b8f2e	broadcast to global parameters when using warmup Reviewed By: asaadaldien, jay-mahadeokar Differential Revision: D5340692 fbshipit-source-id: 80879847ff71c8d620de502ef95a9ffb4bdf595d	2017-06-28 13:35:27 -07:00
Ahmed Taei	fbe2526343	Allow concurrent execution of GLOO broadcast collectives in Summary: This add CollectivesConcurrencyControl class to mange creating common context and cyclic controls to execute GLOO collectivces and refactors AllReduce and _AddDistributedParamterSync to use it Reviewed By: akyrola Differential Revision: D5335795 fbshipit-source-id: 5084e0a65cdb989cd949be3868b77a680561022d	2017-06-28 12:49:12 -07:00
Henry Lu	9a14c013c3	Refactor data_parallel_model to take advantage of Gloo broadcast op in broadcasting across machines and GPUs in one operation Summary: Combine _AddDistributedParameterSync() and _SyncParams() into a single function to broadcast across distributes machines and all local GPU simultaneously. This is similar to how calls to Allreduce has already optimized using the functionalities of Gloo. All the refactoring work is contained in data_parallel_model.py. Reviewed By: akyrola, andrewwdye Differential Revision: D5329277 fbshipit-source-id: 4407b88980cf396f2e0f994d796294fa79fd39ed	2017-06-27 19:35:24 -07:00
Simon Layton	d45f722e43	data_parallel_model: NCCLBroadcast root fix Summary: The root is the root _rank_ and not the root _device_. Thus we always use root=0, regardless of the devices used. https://github.com/NVIDIA/nccl/blob/v1.3.0-1/src/broadcast.cu#L75 /cc slayton58 Closes https://github.com/caffe2/caffe2/pull/872 Differential Revision: D5329564 Pulled By: akyrola fbshipit-source-id: 5a34be30c1a0046a74f28437cb08333c1fb46098	2017-06-27 09:47:48 -07:00
Jay Mahadeokar	04c9c8c5c2	fix for loading model with bmuf Summary: - One line fix for loading saved checkpoint when using Parallelize_GPU_BMUF Reviewed By: asaadaldien Differential Revision: D5315254 fbshipit-source-id: a20ba6438c8e6b2ef44b65270c1d3f9ab645ded0	2017-06-23 17:16:33 -07:00
Thomas Dudziak	342de07231	Core unit test fixes for Python 3 Summary: As title Differential Revision: D5291327 fbshipit-source-id: 7dd9279c53ba55d3422c31973ffcec5705787fdf	2017-06-23 13:22:16 -07:00
Ahmed Taei	5ca263fb1c	Add a warmup option for BMUF Reviewed By: yqwangustc Differential Revision: D5279655 fbshipit-source-id: 7c778a88909580bbe43d4bac4b7d73be0d0e3f27	2017-06-22 14:32:39 -07:00
Ahmed Taei	ffd32c8ab7	Add distributed BMUF implementation. Summary: Refactor data_parallel_model all_reduce and broadcast methods to work for a given parameter set not only gradients and reuse them for BMUF distributed implementation. Add a distributed test (multiprocessing) to BMUF. Reviewed By: akyrola Differential Revision: D5267083 fbshipit-source-id: 8dcc7527d0a755b903d693d8071585f0b54d3403	2017-06-21 16:18:11 -07:00
Aapo Kyrola	34eaa19d27	CPU data parallel model Summary: CPU -version of data parallel model. Great thing is that now we can run data_parallel_model_test in Sandcastle (as it does not have GPUs). Pretty simple change, really. I did not change all variable names with "gpu" in them, to reduce risk (and being a bit lazy). Can improve later. Reviewed By: wesolwsk Differential Revision: D5277350 fbshipit-source-id: 682e0c5f9f4ce94a8f5bd089905b0f8268bd2210	2017-06-20 23:19:08 -07:00
Aapo Kyrola	96f19fefc0	add warning if data parallel model is created for gpus that we dont have Summary: Don't want to assert since it can be useful to sometimes create models that are not run (for example, unit tests). Reviewed By: pietern Differential Revision: D5258905 fbshipit-source-id: f1beee0605bfef235ed0f23f7e78259109720254	2017-06-16 07:02:37 -07:00
Thomas Dudziak	60c78d6160	Fixes range/xrange for Python 3 Summary: As title Differential Revision: D5151894 fbshipit-source-id: 7badce5d3122e8f2526a7170fbdcf0d0b66e2638	2017-06-07 00:04:26 -07:00
Aapo Kyrola	5e6bd4fbfc	Return predict params from ExtractPredictorNet + test Summary: Make it easier for users by returning from ExtractPredictorNet the list of blobs that must be saved/exported to run a predictor net. Added a test for ExtractPredictorNet Codemod. Reviewed By: asaadaldien Differential Revision: D5176097 fbshipit-source-id: b1af42132459487b8d94fcdde0e4c514da608243	2017-06-05 15:34:37 -07:00
Andrey Malevich	a8fb85797c	Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params. Summary: This diff is the first step in the effort for refactoring all parameters. As a first step - I'm merging concept of params and computed_params, that is going to be based on tags instead (in the first version it's still using old data structs to store all the BlobReferences). Renaming computed_params to non-trainable/non-backprop params should be done is some other diff. Reviewed By: salexspb Differential Revision: D5171159 fbshipit-source-id: 68031ca779f053fb266a7c4a2e5b482a3bd9c832	2017-06-02 17:17:57 -07:00
Simon Layton	58874ad5bf	Fp16 training initializers Summary: Re-open for re-importing :) Closes https://github.com/caffe2/caffe2/pull/721 Differential Revision: D5164345 Pulled By: akyrola fbshipit-source-id: e80b32556cd25610602df91a4225b93edc0ca40b	2017-06-01 08:34:46 -07:00
Aapo Kyrola	0f8c8f37a8	Revert D5159712: [caffe2][PR] Fp16 training initializers Summary: This reverts commit 60a889494d2e2f4df1d720331e19f638c5eb95cc Differential Revision: D5159712 fbshipit-source-id: 16040c911b260648857f656f92b165f92c2daae0	2017-06-01 00:17:14 -07:00
Aapo Kyrola	076376f4f6	Revert D5119830: [C2] Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params Summary: This reverts commit 2001090a37346eb12abbb234e13e727c288eb8a7 Differential Revision: D5119830 fbshipit-source-id: bf321868338f0db85dff3237af7eaf74212dbdf6	2017-06-01 00:02:21 -07:00
Andrey Malevich	ff61ed358e	Refactoring of the parameters step 0. Add simple tags and unify interface for params and computed_params Summary: This diff is the first step in the effort for refactoring all paramters. As a first step - I'm merging concept of params and computed_params, that is going to be based on tags instead (in the first version it's still using old data structs to store all the BlobReferences). Renaming computed_params to non-trainable/non-backprop params should be done is some other diff. Reviewed By: salexspb Differential Revision: D5119830 fbshipit-source-id: 2001090a37346eb12abbb234e13e727c288eb8a7	2017-05-31 22:36:36 -07:00
Simon Layton	2bfacff426	Fp16 training initializers Summary: Adds support for generating and training pfp16 models. Added SGD optimizer for multi-precision trainers and a new callback to data_parallel_model in order to help multi-precision models keep their different copies of parameters in sync during training. Closes https://github.com/caffe2/caffe2/pull/697 Differential Revision: D5159712 Pulled By: salexspb fbshipit-source-id: 60a889494d2e2f4df1d720331e19f638c5eb95cc	2017-05-31 17:46:58 -07:00
Ahmed Taei	f0f4c2fc5d	Increase the number of DAG execution worker threads. Reviewed By: akyrola Differential Revision: D5158414 fbshipit-source-id: add377aec5588076db881a2a3750101710f29732	2017-05-31 15:19:19 -07:00

1 2 3

101 Commits