pytorch/test/cpp/api
Shen Li b7b6b612a7 Fix C++ data parallel (#20910)
Summary:
Fixes #19540

CC nmerrill67

C++ data parallel was using Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the the problem by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module.

## Failed Attempt

Initially I tried implementing what we did in `replicate.py`, which
1. create module replicas
2. use Python `Broadcast` autograd function to broadcast every parameter in the original module to all destination devices.
3. assign the broadcast result params to module replicas' `_parameters` dict.

This works in Python because derived module member field params (e.g., `Linear.weight`) and base module `_parameters` (e.g., `Linear._parameters['weight']`) are referencing the same parameter instance. Assigning one of them will apply to both. However, in C++, even though I can modify Module's `parameters_ `values and gradient edges to point to the broadcast source,  I cannot touch the weight and bias member fields in Linear, because replicate cannot (and should not) add special-case handlers to every different module. (See `Linear` [.h](https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/nn/modules/linear.h), [.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/src/nn/modules/linear.cpp)) Although they initially point to the same `TensorImpl` instance, after assigning to `Module.parameters_['weight']`, it will be different from `Linear.weight`.

## Solution Options

gchanan and I had several discussions on this issue and figured two solutions to this problem.

### Option One [implemented in this PR]

Replicate the module in two steps:
1. call `Module.clone()` to create a module replica on every destination device.
2. manually setting gradient edges from every parameter in every replica to the same parameter in the original module.

* Pro: Does not need to change any existing module, and relatively easier to implement
* Con: It is a little hackish.

### Options Two

Implement a `Replicatable` class (similar to `Cloneable`), and make it a friend class of `Module`. For more details see `Note [Replicating Modules]` in the code change.

* Pro: Maybe this aligns more with our existing approach implemented in `Cloneable`?
* Con: Require changes to every existing module.

I am inclined to go with option one, because `replicate` will only be used on data parallel. I feel it is too big an overkill if we have to change all existing module implementations due to a data parallel requirement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20910

Differential Revision: D15556426

Pulled By: mrshenli

fbshipit-source-id: aa836290ec657b32742e2bea80bd0ac2404ef3b0
2019-06-06 11:57:31 -07:00
..
any.cpp Fix Windows build and test in CI (#11716) 2018-11-13 16:35:54 -08:00
CMakeLists.txt Unify libtorch and libcaffe2 (#17783) 2019-05-10 09:50:53 -07:00
dataloader.cpp Refactor ChunkDataReader API + fix missing headers (#19485) 2019-05-08 22:20:19 -07:00
expanding-array.cpp Rewrite C++ API tests in gtest (#11953) 2018-09-21 21:28:16 -07:00
init_baseline.h Kaiming Initialization (#14718) 2019-02-15 14:58:22 -08:00
init_baseline.py Kaiming Initialization (#14718) 2019-02-15 14:58:22 -08:00
init.cpp Fix torch::nn::init::orthogonal_ with CNNs (#18915) 2019-04-09 10:39:15 -07:00
integration.cpp Move isnan to C++ (#15722) 2019-01-08 10:42:33 -08:00
jit.cpp improve error message on inferred type (#21058) 2019-05-30 10:50:34 -07:00
memory.cpp Hide c10::optional and nullopt in torch namespace (#12927) 2018-10-26 00:08:04 -07:00
misc.cpp Kaiming Initialization (#14718) 2019-02-15 14:58:22 -08:00
module.cpp Apply modernize-use-override - 2/2 2019-02-13 21:01:28 -08:00
modules.cpp Rename BatchNorm running_variance to running_var (#17371) 2019-02-22 08:00:25 -08:00
optim_baseline.h Use torch:: instead of at:: in all C++ APIs (#13523) 2018-11-06 14:32:25 -08:00
optim_baseline.py Use torch:: instead of at:: in all C++ APIs (#13523) 2018-11-06 14:32:25 -08:00
optim.cpp Replace cursors with OrderedDict (#13427) 2018-11-07 11:10:05 -08:00
ordered_dict.cpp Replace cursors with OrderedDict (#13427) 2018-11-07 11:10:05 -08:00
parallel.cpp Fix C++ data parallel (#20910) 2019-06-06 11:57:31 -07:00
README.md Rewrite C++ API tests in gtest (#11953) 2018-09-21 21:28:16 -07:00
rnn.cpp Pretty printing of C++ modules (#15326) 2018-12-19 21:55:49 -08:00
sequential.cpp Include named_any.h in modules.h (#21437) 2019-06-06 09:57:33 -07:00
serialize.cpp Ignore nn::Functional submodules in nn::Module serialization (#19740) 2019-04-26 12:47:23 -07:00
static.cpp Make call operator on module holder call forward (#15831) 2019-01-14 14:40:33 -08:00
support.h Use torch:: instead of at:: in all C++ APIs (#13523) 2018-11-06 14:32:25 -08:00
tensor_cuda.cpp push magma init into lazyInitCUDA (#18527) 2019-04-03 12:47:34 -07:00
tensor_options_cuda.cpp Add ScalarType argument to Type::options() (#19270) 2019-04-21 21:16:07 -07:00
tensor_options.cpp Add ScalarType argument to Type::options() (#19270) 2019-04-21 21:16:07 -07:00
tensor.cpp Rename _local_scalar to item() (#13676) 2018-12-04 13:19:26 -08:00
torch_include.cpp Add get/set_num_interop_threads into torch.h include (#20659) 2019-05-20 00:34:59 -07:00

C++ Frontend Tests

In this folder live the tests for PyTorch's C++ Frontend. They use the GoogleTest test framework.

CUDA Tests

To make a test runnable only on platforms with CUDA, you should suffix your test with _CUDA, e.g.

TEST(MyTestSuite, MyTestCase_CUDA) { }

To make it runnable only on platforms with at least two CUDA machines, suffix it with _MultiCUDA instead of _CUDA, e.g.

TEST(MyTestSuite, MyTestCase_MultiCUDA) { }

There is logic in main.cpp that detects the availability and number of CUDA devices and supplies the appropriate negative filters to GoogleTest.

Integration Tests

Integration tests use the MNIST dataset. You must download it by running the following command from the PyTorch root folder:

$ python tools/download_mnist.py -d test/cpp/api/mnist

The required paths will be referenced as test/cpp/api/mnist/... in the test code, so you must run the integration tests from the PyTorch root folder.