mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Shen Li b7b6b612a7 Fix C++ data parallel (#20910 ) Summary: Fixes #19540 CC nmerrill67 C++ data parallel was using Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the the problem by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module. ## Failed Attempt Initially I tried implementing what we did in `replicate.py`, which 1. create module replicas 2. use Python `Broadcast` autograd function to broadcast every parameter in the original module to all destination devices. 3. assign the broadcast result params to module replicas' `_parameters` dict. This works in Python because derived module member field params (e.g., `Linear.weight`) and base module `_parameters` (e.g., `Linear._parameters['weight']`) are referencing the same parameter instance. Assigning one of them will apply to both. However, in C++, even though I can modify Module's `parameters_ `values and gradient edges to point to the broadcast source, I cannot touch the weight and bias member fields in Linear, because replicate cannot (and should not) add special-case handlers to every different module. (See `Linear` [.h](https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/nn/modules/linear.h), [.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/src/nn/modules/linear.cpp)) Although they initially point to the same `TensorImpl` instance, after assigning to `Module.parameters_['weight']`, it will be different from `Linear.weight`. ## Solution Options gchanan and I had several discussions on this issue and figured two solutions to this problem. ### Option One [implemented in this PR] Replicate the module in two steps: 1. call `Module.clone()` to create a module replica on every destination device. 2. manually setting gradient edges from every parameter in every replica to the same parameter in the original module. * Pro: Does not need to change any existing module, and relatively easier to implement * Con: It is a little hackish. ### Options Two Implement a `Replicatable` class (similar to `Cloneable`), and make it a friend class of `Module`. For more details see `Note [Replicating Modules]` in the code change. * Pro: Maybe this aligns more with our existing approach implemented in `Cloneable`? * Con: Require changes to every existing module. I am inclined to go with option one, because `replicate` will only be used on data parallel. I feel it is too big an overkill if we have to change all existing module implementations due to a data parallel requirement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/20910 Differential Revision: D15556426 Pulled By: mrshenli fbshipit-source-id: aa836290ec657b32742e2bea80bd0ac2404ef3b0		2019-06-06 11:57:31 -07:00
..
any.cpp	Fix Windows build and test in CI (#11716 )	2018-11-13 16:35:54 -08:00
CMakeLists.txt	Unify libtorch and libcaffe2 (#17783 )	2019-05-10 09:50:53 -07:00
dataloader.cpp	Refactor ChunkDataReader API + fix missing headers (#19485 )	2019-05-08 22:20:19 -07:00
expanding-array.cpp	Rewrite C++ API tests in gtest (#11953 )	2018-09-21 21:28:16 -07:00
init_baseline.h	Kaiming Initialization (#14718 )	2019-02-15 14:58:22 -08:00
init_baseline.py	Kaiming Initialization (#14718 )	2019-02-15 14:58:22 -08:00
init.cpp	Fix torch::nn::init::orthogonal_ with CNNs (#18915 )	2019-04-09 10:39:15 -07:00
integration.cpp	Move isnan to C++ (#15722 )	2019-01-08 10:42:33 -08:00
jit.cpp	improve error message on inferred type (#21058 )	2019-05-30 10:50:34 -07:00
memory.cpp	Hide c10::optional and nullopt in torch namespace (#12927 )	2018-10-26 00:08:04 -07:00
misc.cpp	Kaiming Initialization (#14718 )	2019-02-15 14:58:22 -08:00
module.cpp	Apply modernize-use-override - 2/2	2019-02-13 21:01:28 -08:00
modules.cpp	Rename BatchNorm running_variance to running_var (#17371 )	2019-02-22 08:00:25 -08:00
optim_baseline.h	Use torch:: instead of at:: in all C++ APIs (#13523 )	2018-11-06 14:32:25 -08:00
optim_baseline.py	Use torch:: instead of at:: in all C++ APIs (#13523 )	2018-11-06 14:32:25 -08:00
optim.cpp	Replace cursors with OrderedDict (#13427 )	2018-11-07 11:10:05 -08:00
ordered_dict.cpp	Replace cursors with OrderedDict (#13427 )	2018-11-07 11:10:05 -08:00
parallel.cpp	Fix C++ data parallel (#20910 )	2019-06-06 11:57:31 -07:00
README.md	Rewrite C++ API tests in gtest (#11953 )	2018-09-21 21:28:16 -07:00
rnn.cpp	Pretty printing of C++ modules (#15326 )	2018-12-19 21:55:49 -08:00
sequential.cpp	Include named_any.h in modules.h (#21437 )	2019-06-06 09:57:33 -07:00
serialize.cpp	Ignore `nn::Functional` submodules in `nn::Module` serialization (#19740 )	2019-04-26 12:47:23 -07:00
static.cpp	Make call operator on module holder call forward (#15831 )	2019-01-14 14:40:33 -08:00
support.h	Use torch:: instead of at:: in all C++ APIs (#13523 )	2018-11-06 14:32:25 -08:00
tensor_cuda.cpp	push magma init into lazyInitCUDA (#18527 )	2019-04-03 12:47:34 -07:00
tensor_options_cuda.cpp	Add ScalarType argument to Type::options() (#19270 )	2019-04-21 21:16:07 -07:00
tensor_options.cpp	Add ScalarType argument to Type::options() (#19270 )	2019-04-21 21:16:07 -07:00
tensor.cpp	Rename _local_scalar to item() (#13676 )	2018-12-04 13:19:26 -08:00
torch_include.cpp	Add get/set_num_interop_threads into torch.h include (#20659 )	2019-05-20 00:34:59 -07:00

README.md

C++ Frontend Tests

In this folder live the tests for PyTorch's C++ Frontend. They use the GoogleTest test framework.

CUDA Tests

To make a test runnable only on platforms with CUDA, you should suffix your test with _CUDA, e.g.

TEST(MyTestSuite, MyTestCase_CUDA) { }

To make it runnable only on platforms with at least two CUDA machines, suffix it with _MultiCUDA instead of _CUDA, e.g.

TEST(MyTestSuite, MyTestCase_MultiCUDA) { }

There is logic in main.cpp that detects the availability and number of CUDA devices and supplies the appropriate negative filters to GoogleTest.

Integration Tests

Integration tests use the MNIST dataset. You must download it by running the following command from the PyTorch root folder:

$ python tools/download_mnist.py -d test/cpp/api/mnist

The required paths will be referenced as test/cpp/api/mnist/... in the test code, so you must run the integration tests from the PyTorch root folder.