Summary:
Split the Caffe2 memory based model into to parts
- Dimension reduction MLP
- DNN with concatenation of memory and obj feature
Currently only implement simple mean
Differential Revision: D4866825
fbshipit-source-id: d2f6813402513ec9af30dbe29a50593e2d3cdb3b
Summary:
Recent diff introduced a duplicate parameter to the model, which would hurt the performance and also affect correctness (duplicate momentum updates, for example). We unfortunately had no checks for duplicate params, outside of data_parallel_model, which fortunately brought this into our attention.
But it is better to have a Validate() function in model_helper, and call that before adding gradient ops and querying for parameters. Added to brew_test calls as well.
Reviewed By: kennyhorror
Differential Revision: D5163458
fbshipit-source-id: 35692e8bfcc359d4e8bc73e6f2358659f6e45ceb
Summary:
This diff is the first step in the effort for refactoring all paramters. As a
first step - I'm merging concept of params and computed_params, that is going
to be based on tags instead (in the first version it's still using old data
structs to store all the BlobReferences).
Renaming computed_params to non-trainable/non-backprop params should be done is
some other diff.
Reviewed By: salexspb
Differential Revision: D5119830
fbshipit-source-id: 2001090a37346eb12abbb234e13e727c288eb8a7
Summary:
It's causing problems inside docker containers:
`InvalidArgument: Insufficient bytes of entropy to draw requested array. shape=(5, 9, 10, 5), dtype=float32. Can you reduce the size or dimensions of the array? What about using a smaller dtype? If slow test runs and minimisation are acceptable, you could increase settings().buffer_size from 8192 to at least 18432000.`
Closes https://github.com/caffe2/caffe2/pull/707
Differential Revision: D5162621
Pulled By: Yangqing
fbshipit-source-id: 55544210961cbc80828dca2cbeba6a5ace8cf8d1
Summary:
This warning becomes an error with https://github.com/numpy/numpy/pull/6271 (`>=0.12.0`).
```
caffe2/python/operator_test/tile_op_test.py::TestTile::test_tilewinput
/opt/caffe2/caffe2/python/operator_test/tile_op_test.py💯 VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
dims[axis] = tiles
/usr/lib/python2.7/dist-packages/numpy/lib/shape_base.py:873: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
return c.reshape(shape_out)
```
Closes https://github.com/caffe2/caffe2/pull/710
Differential Revision: D5160776
Pulled By: Yangqing
fbshipit-source-id: b264e0e389de5817a289db878c15e655f9fa2f09
Summary:
Adds support for generating and training pfp16 models. Added SGD optimizer for multi-precision trainers and a new callback to data_parallel_model in order to help multi-precision models keep their different copies of parameters in sync during training.
Closes https://github.com/caffe2/caffe2/pull/697
Differential Revision: D5159712
Pulled By: salexspb
fbshipit-source-id: 60a889494d2e2f4df1d720331e19f638c5eb95cc
Summary:
Currently we can get into broken situations when some nodes working on computation detectChanges() faster than others, thus only some of the nodes start doing next iteration of training. This is an inconsistent state. To prevent this to happen, now each node sets a "re-rendezvous flag" and that is allreduced after each iteration. Once all agnodes agree, re-rendezvous will be done.
Also noticed that min_shards=1 does not work because data parallel model assumed num_shards>1 when rendezvous is not None. Fixed that.
Reviewed By: andrewwdye
Differential Revision: D5156282
fbshipit-source-id: f2ccbd8ad13ed37f7813ff8ad1080d963d0d17e3
Summary: Add information about the offending param when assertion fires.
Reviewed By: kennyhorror
Differential Revision: D5153625
fbshipit-source-id: 9f5a02bf64ccbdef9d93d346f79e589dfe3ec5be
Summary: Fix an issue where the parameter is not created in param_init_net, or net, and then we secondarily look at which device op outputs the gradient. This did not work if the gradient was a GradientSlice.
Reviewed By: harouwu
Differential Revision: D5153102
fbshipit-source-id: 20eae660ea32e5a9ea484bf93c04c8f8c71a51ed
Summary: If ConstantFill (or other fill op) is used in CUDAContext, with input_as_shape, the code crashes as it expects the shape be in CUDAContext but accesses the array in host code... We could fix this by copying the values from the CUDA tensor, but it is probably best to enforce the shape param is in CPU context. This is what this diff does.
Differential Revision: D5152766
fbshipit-source-id: 0629a189bd1d800c0b7c9dbc324b78d279efac0b
Summary:
Bug repro is in a test. Generally speaking accumulation was
not happening if len(ys) >= 2 (list of blobs we compute gradients
from) and for some blob in the net it was both in ys list and also got
a gradient propagated from another element in ys.
Reviewed By: akyrola
Differential Revision: D5121695
fbshipit-source-id: 282d88f2f4f6e27dadae311964f40246a2739130
Summary: It looks like it's a bit too restrictive requirement. Let's remove it.
Reviewed By: volkhin
Differential Revision: D5150968
fbshipit-source-id: 9e38574edc6542c5ce3c7f25a01afe8f5ff9b507
Summary:
Fixes some performance issues when `broadcast_computed_params=True` is passed to Parallelize_GPU. Enabled via the same `use_nccl` flag as AllReduce
Closes https://github.com/caffe2/caffe2/pull/630
Differential Revision: D5149828
Pulled By: akyrola
fbshipit-source-id: 12c9714c7fa078811f1cde61c8523dca8f7f968f
Summary: These return views in Python 3 which would not do anything in a lot of usages currently present in Caffe2. This diff simply removes (almost) all usages of these two in Caffe2 and sub projects in favor of comprehensions which are also easier to read/understand
Reviewed By: akyrola
Differential Revision: D5142049
fbshipit-source-id: e800631d2df7d0823fed698cae46c486038007dc
Summary:
Looking at one segfault at exit (https://our.intern.facebook.com/intern/chronos/jobinstance/?jobinstanceid=911625597&smc=chronos_gp_admin_client&log_type=stderr&offset=0&pretty_logs=false) and it's coredump, only thing I can see that a FreeBlob() operator is called concurrently while a cudaMemcpyAsync (on thread 1) is crashing. FreeBlobOp is only called at data_workers _stop() (via utils.ResetBlobs()), and only code that could run a cudaMemcpyAsync that time is the fetcher -thread of data_workers that is enquing blobs.
Here are the stacks: P57455299
This is clearly a bug since we should only clear the scratch blobs after all threads are terminated, which happens at wait_for_finish().
I am not 100% sure this fixes all the segfaults, but at least this one was most likely caused by this.
Reviewed By: andrewwdye
Differential Revision: D5146278
fbshipit-source-id: ae00796706bfc4fee6823caf6529b62ab20c1cd3
Summary:
This diff does two things:
- add supports for optimizer to data_parallel_model. User can supply optimizer_builder_fun instead of param_update_builder_fun. The latter is called for each GPU separately with proper namescope and devicescope, while optimizer builder only is called once and adds optimizes to the whole model.
- use MomentumSGDUpdate instead of MomentumSGD + WeightedSum. This bring major perf benefits.
Changes resnet50 trainer to use optimizer.
This relies on D5133652
Reviewed By: dzhulgakov
Differential Revision: D5142973
fbshipit-source-id: 98e1114f5fae6c657314b3296841ae2dad0dc0e2
Summary:
I'll let y'all decide how you want to fix this (probably need a persistent curand buffer). Here's a test to verify the fix.
Closes https://github.com/caffe2/caffe2/pull/495
Differential Revision: D5148815
Pulled By: akyrola
fbshipit-source-id: e80dabe65230ddd32340f2d872cd8786ac960bf8
Summary:
hankun is using the optimizer, but having mixed set of of GPU and CPU operators. Currently this won't work with optimizer since it adds optimizers for all parameters in the current device scope. But we can actually infer the device that a param belongs to by looking at the device option in the param_init_net.
Added a test as well.
Reviewed By: salexspb
Differential Revision: D5133652
fbshipit-source-id: ad8689d75ac1f5c78981bae1b6978fe91e40ef0f
Summary:
See discussion at https://github.com/caffe2/caffe2/pull/633#issuecomment-303536902
Tested with a TitanX (Pascal) and a TitanZ (Kepler) with this access pattern.
```
Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU1) : No
> Peer access from TITAN X (Pascal) (GPU0) -> GeForce GTX TITAN Z (GPU2) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU1) -> GeForce GTX TITAN Z (GPU2) : Yes
> Peer access from GeForce GTX TITAN Z (GPU2) -> TITAN X (Pascal) (GPU0) : No
> Peer access from GeForce GTX TITAN Z (GPU2) -> GeForce GTX TITAN Z (GPU1) : Yes
```
All combinations pass:
* `0,1`
* `0,2`
* `1,2`
* `0,1,2`
Closes https://github.com/caffe2/caffe2/pull/659
Differential Revision: D5148779
Pulled By: akyrola
fbshipit-source-id: 6263edfe8b36623983f1946b5c3f4a3fef415a45
Summary:
Failure mode:
```
- 7 passing examples, 0 failing examples, 0 invalid examples
- Typical runtimes: 12-14987 ms
- Stopped because settings.timeout=60
```
After this change:
```
- 5 passing examples, 0 failing examples, 0 invalid examples
- Typical runtimes: 12-15475 ms
- Stopped because settings.max_examples=5
```
Obviously, the `DYNAMIC_PROGRAMMING` tests are the troublemakers. An alternate solution would be to make separate tests for the two assignment algorithms (one fast, one slow).
Closes https://github.com/caffe2/caffe2/pull/676
Differential Revision: D5147363
Pulled By: akyrola
fbshipit-source-id: 85d9f8198e53c10de2a8d6645e2b0eb7953c96e0
Summary: This diff is one step towards enabling python 3 build by making it be more diligent in its handling of strings.
Reviewed By: salexspb
Differential Revision: D4893083
fbshipit-source-id: 28b8adf3280e8d1f0a7dc9b0fee5ad53f2fada57
Summary: Refactored SoftmaxWithLoss by removing the code for spatial=1 mode and created a new op SpatialSoftmaxWithLoss that has the spatial mode implemented.
Reviewed By: viswanathgs
Differential Revision: D5104120
fbshipit-source-id: 8ab999e32c916b2a39a670a7b2a3365401535f24
Summary:
Makes benchmark a bit hacky, but it's a benchmark after all :)
Specifically ports functionality of proper BenchmarkNet run from the ads_benchmarks so that we can see training net perf.
Also adds --report_interval parameter to print stats more often when running in hogwild mode
kdub0 - hopefully if you have time you can integrate it properly with the Flow's workflow
harouwu -shouldn't conflict too much with your current diff
Reviewed By: rayleichen
Differential Revision: D5125183
fbshipit-source-id: 9c6f1663bc85e26d6609f0f2f23aa280731939db
Summary:
To make optimizer for sparse gradients work with CUDA, we need UnsortedSegmentSum and Mean implemented for CUDA. Unique was already implemented by harouwu.
Pretty straightforward implementations, should be fast enough -- and i don't know a faster way anyway.
Added some tests as well.
Reviewed By: asaadaldien
Differential Revision: D5124548
fbshipit-source-id: 63ae72f45fc2f07470603f7b2de12f34635dbb3d
Summary:
This is going to unblock Nvidia in their work on adding fp16
support to Caffe2. I discussed this with kennyhorror before to make
sure this fits into his work on parameter sharing.
Reviewed By: kennyhorror
Differential Revision: D5127797
fbshipit-source-id: 4db155d320b1862570c23b77c4252bdacbf2296f
Summary:
If there're 2 SparseToDense layers that are densifying same IdList feature
it'll result in the situation, where we might export invalid input for the
prediction in input specs. This diff is changing the behavior to support to use
Alias to a new blob instead of passing things directly.
Reviewed By: dzhulgakov
Differential Revision: D5093754
fbshipit-source-id: ef4fa4ac3722331d6e72716bd0c6363b3a629cf7
Summary: Currently using two tower models with cosine distance results in bad calibration. Adding bias to the output of cosine term solves the problem.
Reviewed By: xianjiec
Differential Revision: D5132606
fbshipit-source-id: eb4fa75acf908db89954eeee67627b4a00572f61
Summary: Memory leak happens when new BlobReference is constantly added to the set _scratch_blobs
Reviewed By: panshen1
Differential Revision: D5134945
fbshipit-source-id: 3ce4d482153bb89de065f20cd91411178085caad
Summary:
This was hardcoded at 4 before but should be made
configurable. Can be kept low for big MLPs and higher for convnets.
Reviewed By: akyrola
Differential Revision: D5126138
fbshipit-source-id: 713ee8bbeb243b7de1479808fd6398d397e0b49a
Summary:
Implement SizeOp that returns the number of elements in the input
tensor.
Output is 1D tensor that contains the number of elements
Reviewed By: akyrola
Differential Revision: D5101061
fbshipit-source-id: d1c56053b6f3b41c65ac574dd748482775d1ea0d
Summary: In some cases (for example, when include_tags option is used) output_schema contains blobs that aren't produced by the generated net. In this case we want to filter them from output_schema as well.
Differential Revision: D5120115
fbshipit-source-id: f98ea3f747589390b039d1e1987becec3980634c
Summary:
D5116828 changed how in-place ops were hanled in memonger and fixed a crash in NeuralMT. However, it still produced incorrect memongerization, because an op with one inplace input-output but another non-inplace output would be handled still incorrectly, as the other output's branch would not be followed properly.
This is fixed by actually removing the whole in-place op special handling. This actually is not needed anymore, it was leftover from an older version of memonger that used topological sort of the ops.
Reviewed By: asaadaldien
Differential Revision: D5128142
fbshipit-source-id: b551b0faebdde410e6bd7516958c63cf610cc065
Summary: When two or more blobs are gathered by the same indices blob in a data parallel model, we used to concatenate multiple times and re-write to the same indices blob. This leads to illegal memory access at times because the gradientslice indices blob is longer than its corresponding gradientslice values blob. This diff adds a check in order to avoid this.
Reviewed By: akyrola
Differential Revision: D5116817
fbshipit-source-id: 1c086d092eb6d48926d600f9408f578f5ddc41c7
Summary: Gradient test for tile op was flaky because i had made the dimensions too large. This caused push blocking errors. Also I noticed my test_grad_tile was incorrect.
Reviewed By: asaadaldien
Differential Revision: D5126476
fbshipit-source-id: ae9ce5d9041648d7a4535fc88d4013e669bd6f02