Commit Graph

28 Commits

Author SHA1 Message Date
Baichuan Yuan
dca97b4394 Weighted decay with frequency (count-based) (#60382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60382

Instead of setting weight_decay w uniformly for all ids, for each row i in the sparse embedding table, the actual weight_decay `w_i` becomes `w*freq_i` where `freq_i = halflife/counter_i \in [\log(2), halflife]`. Counter is from `rowwise_counter` with definition `counter_i = 1 + \exp(-iter_{\delta}*\rho)*counter_i`.

Test Plan:
buck test //caffe2/caffe2/python/operator_test:adagrad_test -- test_row_wise_sparse_adagrad

buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay

Reviewed By: 0x10cxR1

Differential Revision: D25581030

fbshipit-source-id: 54b3831b20516c76c559b13d8deb809e2ee3b446
2021-06-21 18:46:35 -07:00
Natalia Gimelshein
db5e5781ad replace all remaining occurrences of deadline=1000, to prevent test flakiness
Summary: Per title

Test Plan: Fixes existing tests

Reviewed By: robieta

Differential Revision: D28690296

fbshipit-source-id: d7b5b5065517373b75d501872814c89b24ec8cfc
2021-05-25 15:55:30 -07:00
Natalia Gimelshein
056287aec4 turn off deadline for adagrad test
Summary: Tests are frequently failing with "exceeded the deadline of 1000.00ms", we expect this to happen, so remove the deadline

Test Plan: N/A: Fix breakages

Reviewed By: robieta

Differential Revision: D28581051

fbshipit-source-id: 4825ada9af151fa5d57c45c549138c15ba613705
2021-05-20 13:47:02 -07:00
Bugra Akyildiz
27c7158166 Remove __future__ imports for legacy Python2 supports (#45033)
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:

```2to3 -f future -w caffe2```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033

Reviewed By: seemethere

Differential Revision: D23808648

Pulled By: bugra

fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
2020-09-23 17:57:02 -07:00
Christopher Whelan
5cd0f5e8ec [PyFI] Update hypothesis and switch from tp2 (#41645)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41645

Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1405

Test Plan: buck test

Reviewed By: thatch

Differential Revision: D20323893

fbshipit-source-id: 54665d589568c4198e96a27f0ed8e5b41df7b86b
2020-08-08 12:13:04 -07:00
Taiqing Wang
8cb1f2f9dc implement L2 regularization for Adagrad in caffe2 and dper (#37705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37372

Posted note: [Regularizing SparseNN Against Over-fitting](https://fb.workplace.com/notes/taiqing-wang/regularizing-sparsenn-against-over-fitting/220306075902708/)

**Problem formulation**

L(w) = J(w) + lambda/2 * ||w||^2
J(w) is the empirical loss, and ||w||^2 is the squared L2 norm of the parameters, a.k.a. L2 regularizer.

dL(w)/ dw_i = dJ(w)/dw_i + lambda w_i
dL(w)/ dw_i is the gradient of L(w) w.r.t. w_i.

To implement the L2 regularizer, the gradient of J(w) w.r.t. w_i is added with w_i. lambda is called as weight decay in this implementation.

**Code changes**
* In the initialization method of AdagradOptimizer, a new input argument, weight_decay, is added.
* In the _run function of AdagradOptimizer, the weight decay will be skipped for 1d bias vectors.
* In the parameter update functions of Adagrad, the gradient is updated by weight_decay * w_i. The default value for weight_decay is zero.

Test Plan:
`
buck build caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay
`

`
./buck-out/gen/caffe2/caffe2/fb/dper/layer_models/tests/split_1/sparse_nn_test_weight_decay#binary.par
`

Reviewed By: jspark1105

Differential Revision: D21258652

fbshipit-source-id: d2366ddcd736a03205a2d16f914703b16d9fce8f
2020-05-03 10:42:49 -07:00
Jongsoo Park
c7448aa13c remove unused parameters in optimizer tests (#18084)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18084

data_strategy parameter was not used in some of unit tests for optimizers

Reviewed By: hyuen

Differential Revision: D14487830

fbshipit-source-id: d757cd06aa2965f4c0570a4a18ba090b98820ef4
2019-03-15 18:06:15 -07:00
Jongsoo Park
8bd9465b79 make momentum non negative in adagrad test (#18009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18009

momentum should be initialized with non-negative values

Reviewed By: hyuen

Differential Revision: D14450841

fbshipit-source-id: 5bbbd11645db9e6f2dc42b26a00ff3caf378c59f
2019-03-14 03:15:07 -07:00
rohithkrn
8c72217817 Enable boolean_mask, adadelta, adagrad fp16 on ROCm (#17235)
Summary:
-  Fix bugs, indentation for adadelta and adagrad tests to enable fp16
- Enable boolean_mask fp16  on ROCm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17235

Differential Revision: D14240828

Pulled By: bddppq

fbshipit-source-id: ab6e8f38aa7afb83b4b879f2f4cf2277c643198f
2019-02-27 10:07:36 -08:00
Jongsoo Park
cff509e2b1 share code between adagrad and rowwise adagrad tests (#14692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14692

Remove some code duplication

Reviewed By: chocjy

Differential Revision: D13296731

fbshipit-source-id: 5924e037ca64fc4b89234be922bc5ca47fb8bd32
2018-12-10 22:10:39 -08:00
Will Feng
4c06f1f2bb CircleCI: enable all flaky tests (#13356)
Summary:
A few Caffe2 tests are currently disabled in `py2-gcc4.8-ubuntu14.04` test job because they are known to be flaky. https://github.com/pytorch/pytorch/pull/13055 likely had fixed the flakiness, and this PR tests it.

Fixes https://github.com/pytorch/pytorch/issues/12395.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13356

Differential Revision: D12858206

Pulled By: yf225

fbshipit-source-id: 491c9c4a5c48ac1b791fdc9d78acf66091e80457
2018-10-31 09:34:49 -07:00
Will Feng
cdead5ace1 Enable CircleCI for Linux jobs (#12389)
Summary:
Changes in this PR:
1. Intermediate Docker image is shared from build stage to test stage through ECR, in order to fix the Caffe2 flaky CUDA tests.
2. There are ~7 Caffe2 operator tests that are only flaky in `caffe2_py2_gcc4_8_ubuntu14_04_test` on CPU. Disabling those tests on that config only, which is okay to do because we are still running those tests in other test jobs.

After this PR is merged, CircleCI will be running on master automatically, and will be running on PRs if the author rebased their PR onto the newest master (which we will ask all the authors to do when we switch off Jenkins for Linux).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12389

Differential Revision: D10224267

Pulled By: yf225

fbshipit-source-id: dd1a90a425c3d13b870d3d328cb301eee2e6e2cd
2018-10-08 17:09:37 -07:00
Ansha Yu
98aebed88e Refactor tests part 1 (#11350)
Summary:
Followup to [the serialized test framework](https://github.com/pytorch/pytorch/pull/10594)

Round 1 for refactoring tests, starting alphabetically. I added some functionality, so I wanted to send out some of these initial changes sooner.

I'm skipping all tests that don't explicitly call assertReferenceChecks. Some tests directly call np.allclose, and others are simply TestCase (rather than HypothesisTestCase).

1. Start alphabetically producing serialized outputs for test functions, annotating those we want to include with `serialized_test_util.given`. So far I've only added one test per operator, but this already does seem to add quite a few tests.
2. Add functionality to allow us to generate outputs using pytest by adding pytest argument options. This allows us to skip adding a `__main__` function to quite a few tests.
3. Catch any exceptions generating the gradient operator and skip serializing/reading it, since certain operators don't have gradients.
4. Add functionality to better handle jagged array inputs, which numpy doesn't handle very well. We simply explicitly do the conversion to dtype=object.
5. Make only one file per test function, rather than 4, to reduce the number of files in the github repo.

I also noticed that there is some hypothesis handling that makes `serialized_test_util.given` not compatible with adding more hypothesis decorators on top. For example, there are tests that do
```
settings(...)
given(...)
def test_my_stuff(...)
```
But there is a hypothesis handler that explicitly checks that `given` is called below `settings`, so we cannot refactor this to `serialized_test_util.given`. I've just avoided decorating these kinds of tests for now, I hope that's alright.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11350

Reviewed By: houseroad

Differential Revision: D9693857

Pulled By: ajyu

fbshipit-source-id: a9b4279afbe51c90cf2025c5ac6b2db2111f4af7
2018-09-18 10:42:10 -07:00
Will Feng
c9e66351a7 Port all PyTorch and Caffe2 jobs to CircleCI (#11264)
Summary:
This PR adds all PyTorch and Caffe2 job configs to CircleCI.

Steps for the CircleCI mini-trial:
- [ ] Make sure this PR passes Jenkins CI and fbcode internal tests
- [x] Approve this PR
- [ ] Ask CircleCI to turn up the number of build machines
- [ ] Land this PR so that the new `.circleci/config.yml` will take effect

Several Caffe2 tests are flaky on CircleCI machines and hence skipped when running on CircleCI. A proper fix for them will be worked on after a successful mini-trial.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11264

Differential Revision: D9656793

Pulled By: yf225

fbshipit-source-id: 7832e90018f3dff7651489c04a179d6742168fe1
2018-09-05 16:28:11 -07:00
Bram Wasti
82b981e4db Update from facebook 1ee4edd286a3 (#8040)
* Adding instance weight to batch distill loss

as title

* add bfloat 16-31

added bfloat 16-31 and their respective unit tests

* [CUDA9] Upgrade - fbcode

CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").

This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)

* Share intermediate int32 buffer across Conv ops

Adding a known type

* [C2 fix] infer function for ensure_cpu_output_op

this is adding the missing device funtion for ensure_cpu_output_op

* [int8] Add blob serializer/deserializer for Int8TensorCPU

To export to logfiledb

* [nomnigraph] Add try catch block to optimization passes in predictor

This will catch failures that happen in the optimization pass.

* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE

CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.

Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.

* NUMA support in SparseNN CPU benchmark

Adding support for NUMA in SparseNN CPU benchmark

* [mobile-roofline] Add logging needed for roofline model

This should be all that's needed

* Let the operators using the same input if the operators are not chained

or else, we have to change the input data dims

* fix null-pointer-use UBSAN errors in in reshape_op.h

* revert previous fix on input blob name

as title

* Adding flag to let MineHardNegative automatically extract single value from dict

Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.

* Reverting change that broke internal tests back to OSS compatible state
2018-06-01 17:41:09 -04:00
bddppq
f94ae3ba1d
Update from facebook (#7696)
* Fix handling of empty batches in SumReduceDimsOp

As titled

* Deferrable async_scheduling finishRun fix

Proper order of finishing run operations in deferrable_async_scheduling net

* Simplify exception handling in async_scheduling

Simplify exception handling, no need to busy wait, thread that processes the
last task can finish the run

* [C2]worker_coordinator_memorize_worker_ids

As titled. This is related to T28689868, where the number of blobs we want to create is equal to the number of worker ids

* Add unit test for nets with no type set

* Ignore total length argument in sympolic_pad_packed_sequence

1- There was a mistake in the code that total_length was added to the wrong symbolic function (pack_padded_sequence) instead of (pad_packed_sequence)
2- No need to throw an exception if total_length is given since it is only used to enable data_parallel training on multi-gpus and doesn't have anything to do with onnx export, so just ignore it. https://fburl.com/tk4gciqp

* Add support for MKLDNN to async_scheduling

Just add MKLDNN as a possible CPU option to async_scheduling's pool function

* [AuFL][ensemble] support branch output for prediction

This diff supports using predictions from different branches and thus enables model ensembling (not fully independent).

* Fix a bug in add_loss in layer_model_helper

As titled.

* Support lradaption for adam

1.lr adaption operator
2.apply to dense adam

* Perf tweaks for async_scheduling

Restore single pool option + remove unnecessary (no-ops) calls

* add quantization to SparseSimdAdagradOp

add a bunch of quantization signatures to SparseSimdAdagradOp, implementations to come next

* [sr] [codemod] Change all SR callsites to use new API

@allow-large-files

This diff refactors all callsites of SR to use the slightly changed API introduced in the diff below. Really what this means is that you need to include the correct header. Also if you were using `ClientFactory::newFactory` you need to not prefix it with `ClientFactory::`.

```
cd ~/fbsource/fbcode
find ./ -type f -exec sed -i -e 's:#include "servicerouter/client/cpp2/ClientFactory.h":#include "servicerouter/client/cpp2/ServiceRouter.h":' -e 's:#include <servicerouter/client/cpp2/ClientFactory.h>:#include <servicerouter/client/cpp2/ServiceRouter.h>:' -e 's/ClientFactory::newFactory(/newFactory(/g' {} \;
```

Also manually fixed spots that couldn't be done automatically (or broke because they depended on transitive includes).

* Back out "Fix handling of empty batches in SumReduceDimsOp"

Original commit changeset: 282da1730cc2 This commit is blocking the
Github->fbcode sync, which really needs to get merged ASAP. D7881937 which this
diff depends on will be reverted in the sync D7990948 which causes this to
break. The sync diff cannot be patched with this reversion because it must be
landed against base revision 5c8c099 , and D7881937 must not be included in the
sync diff because it is breaking GPU tests that are not available in sandcastle
: https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-cuda8.0-cudnn6-ubuntu16.04-test/3638/console
for one example.

* Add the flow to support operator benchmark

1) generate model with the operator 2) upload to everstore 3) generate model spec into json file 4) start running the benchmark

* [tum][gpu] Connect DPM trainer with flow and unit tests

This diff:
- Fix some small bugs for Yiming's recent changes to parallelizer, so it suits real use cases.
- Add correct tags to the TUM code, so we can do data parallel transform
- pass extra info when instantiation.
- add unit test for using DPM in TUM model

After this diff, we can do simple box, multi-gpu fully-sync trainer for TUM in Fblearner workflow, but may still need to do speed benchmarking.

* w/o normalized lradaption for adam dense only

The previous lr adaption includes a normalization step when performing the dot product operation. This is not exactly same as what is proposed in the paper. I add normalization as an option. Without it, the operator performs exactly what the paper proposed. With the option, we add the normalization step

* [fb] Use SharedPromise in DeferrableAsyncSchedulingNet

This code is to simplify DeferrableAsyncSchedulingNet by removing condition
variable + small fixes

* [tum] implement cuda sparseLengthsMean and LengthsMean

as title

* Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.

Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.

* Move feature_to_index to FeatureSpec.feature_to_index

move feature_to_index to FeatureSpec.feature_to_index to avoid override other fields

* [Caffe2] Rename bytes_moved to bytes_written

Just a rename in preparation for supporting bytes_read.

* [c2] fix ReduceFrontSumOp for empty case by setting 0

otherwise, it may use the results from last iteration when it's empty batch.

* [Caffe2] [Int8] Improve Intel CPU performance

* [Easy] Improve PrependDim op logging

as titled

* DBFileReader expand db_path using os.path.expanduser(..)

Since there are a lot of possible use cases of `DBFileReader` to read from user home path, like `~/local/sample.db`, I want to save people's trouble of calling `os.path.expanduser(db_path)` themselves.

* [Caffe2] Add bytes_read to cost structure

We're adding analytical read bytes to cost functions.  This extends the structure accordingly for all CostInference defined operators.
Additionally, some small bug fixes were performed:
1) Cost functions now extract type information of operands instead of assuming float

* Fix sleef on aarch64 for hhvm

@bypass-lint

Rename flag

* Remove duplicated part in caffe2/ideep/operators/conv_op.cc

should be sync error

* Rename test helper function test_adagrad_sparse_helper to adagrad_sparse_test_helper to avoid confusing pytest
2018-05-19 23:10:48 -07:00
Lu Fang
664fe34e0a
[Caffe2][fbcode=>GH sync] Update from facebook 4323b18ce13c (#7116)
* [fix] Re-enable events in RNN ops

We have earlier added event disabling in RNN ops as back then we didn't use
events, with current use cases this is no longer true
(https://fburl.com/8vd0lp8y)

* use ops with cude impl

* Revert D7729695: [caffe2][fix] Re-enable events in RNN ops

This reverts commit 4b215c7496fb724656ff4c776933a15bdbbcde5e

@bypass-lint

An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files

* [observer] Clean up observer_config.h

#accept2ship

* [1/n] Refactor dataio_test.py

Replace code duplication with a common function

* Add barrier net that runs before training nets

Add a synchonize barrier net that is run before training nets.  With this net, shards that are faster will wait for other shards before start training.  This reduce chances of the faster shards timing out during GLOO AllReduce.

Removed explicit data_parallel_model.py.synchronize call in holmes workflow.  Similar change in speech/asr_training workflow will come in another diff.

* Support the dnnlowp backend in caffe2_benchmark

This is for SHARE operator latency evaluation

* Migrate integral_image_op to main caffe2

migrate integral_image_op(GPU version) given by https://fburl.com/yvqezigi
to caffe2/caffe2/operators and implement its CPU version. Write up a test
using the hypothesis_test mechanism

* [pos_disc, fbcode] Implement unjoined lr loss

As explained in https://our.intern.facebook.com/intern/wiki/Model_Based_Calibration/, when the dataset is an joined data set, where labels might change later, we need to use unjoined logloss.

The implementation is almost the same as in Sigrid (https://fburl.com/1trngsls), where
    loss = y (log(p) - log(1-p)) + (1-y)(log(1-p)) = xy - (1-y)x - (1-y)log(1+exp(-x))

For x < 0, to ensure stability and avoid overflow, we reformulate the above exp as
    loss = xy - (1-y)x - (1-y)x + (1-y)log(1+exp(x)) = xy + (1-y)log(1+exp(x))

Then the final expression becomes
    loss = xy + (y - 1) x (x >= 0) - (1 - y) log(1 + exp(x - 2 x (x >= 0)))

where y is the true label, x is the dot product and p = logistic(x).

This kind of implementation is align with the current implementation of the original cross entropy in
https://phabricator.intern.facebook.com/diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/cross_entropy_op.cc;0bae3b5d0f825897c5e0dd0ff10f489d7271bf25$7-13

* Keep the array to fix the conflict

* [C2] Compute Adagrad effective LR

The AdagradWithLR op outputs an extra blob which is contains the average effective learning rate across all weights in this blob.

* Open-source extractMetaNetDef & runGlobalInitialization, add new Predictor constructor from db file, and add run_map_outputs

1. Open-source extractMetaNetDef and runGlobalInitialization, for use in
2. new Predictor constructor from db file.
3. Add new run function that returns outputs as TensorMap

* Disable eigen cpu

Disable eigen cpu in transpose and reduce

* Introduce request_only/object_only property of ModelLayer

by default this is False

* A simple TC Caffe2 benchmark

We can run tunner, get MappingOptions and then use them to
compare against cuBLAS

currently broken due to LLVM issues. How to run:

hg checkout eec1ab31b59c03b8deded1c755a9abaf8c45be01
add D7401202
add D7434625
add D7506031
add D7540728

buck run @mode/dev-nosan tc/tc/benchmarks_python:caffe2_benchmark

* Move Caffe2 feature_maps_ops to open source

Need feature maps operators in open source project facebookresearch/BlueWhale

* Manually fix the conflicts in channel shuffle op

* Fix the inconsistency between different gh and fbcode

* Skip Adagrad GPU Test (Because some gpu implementation is missing)

* Fix another test to make sure it won't run on gpu when implementation is not available yet
2018-05-01 20:49:00 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Pieter Noordhuis
d4db1b90a1 Resuppress adagrad health checks
Summary:
Commit 479e4ce5 didn't end up solving the health checks firing and
they are likely still caused by the remaining `assume` calls.
Closes https://github.com/caffe2/caffe2/pull/1625

Differential Revision: D6573036

Pulled By: pietern

fbshipit-source-id: eeb21bdd61dca0a632eb1ba9e529177ac2569bfd
2017-12-14 16:34:41 -08:00
Dong Li
c16a21b67d removed the device_type assumption in adagrad_test
Summary: the "assume" statement in adagrad_test leads to health check failure. here we remove it by checking dc == hu.gpu_do

Reviewed By: pietern

Differential Revision: D6513314

fbshipit-source-id: 4caf2d938e5f5935a95cca8abd99185182223d63
2017-12-13 03:35:51 -08:00
Pieter Noordhuis
fca617c62f Suppress hypothesis health check in adagrad_test.py
Summary:
PR #1536 suppressed test_sparse_adagrad but test_row_wise_sparse_adagrad also filters too many examples. Suppress health checks for this test as well.
Closes https://github.com/caffe2/caffe2/pull/1599

Differential Revision: D6530850

Pulled By: pietern

fbshipit-source-id: c73f30d2e104565421e3e381b1cf66185edc833e
2017-12-10 11:47:15 -08:00
Pieter Noordhuis
6f218cef25 Supress hypothesis health check in adagrad_test.py
Summary:
With some test seeds this warning starts firing.

Should be addressed in a better way, not generating as many invalid examples.
Closes https://github.com/caffe2/caffe2/pull/1536

Reviewed By: bddppq

Differential Revision: D6437138

Pulled By: pietern

fbshipit-source-id: c619d928a585e3d887f686db5d98f841af10c56b
2017-11-29 11:35:04 -08:00
Dong Li
3bfabb4d5f support float16 input for operator SparseAdagrad
Summary:
Implemented new CUDA class for operator SparseAdagrad. The param and moment inputs now can be float or float16.
The functions for mixed-precision add/mult/store are defined in a separate head file ("caffe2/core/float16_util.h") for reuse purpose.

Reviewed By: azzolini

Differential Revision: D5880200

fbshipit-source-id: dca227f38629a03a9d771f42efe2c0b673075c4d
2017-10-30 19:32:30 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Frank Jiang
0a5ee1e806 Implemented RowWiseSparseAdagrad operator that only keeps one moment term per embedding
Summary: Implemented version of SparseAdagrad that only keeps track of an average sum of squared gradients term for each row of the parameter tensor, rather than a sum of squared gradients term for each individual parameter.

Differential Revision: D5881918

fbshipit-source-id: bd96ccf25554b457baaaca9309fc8048adbb37f7
2017-09-26 13:34:44 -07:00
Dmytro Dzhulgakov
cf1ce29631 Fix GPU SparseAdaGrad with empty tensors
Summary: CUDA doesn't like 0-sized grids :)

Reviewed By: Yangqing

Differential Revision: D5495805

fbshipit-source-id: 6819513024978ee6bb70a39b25d23ced06465750
2017-07-25 23:50:54 -07:00
Luke Yeager
a47652379f Fix SparseAdagrad for indices.ndim>1
Summary:
Same fix as https://github.com/caffe2/caffe2/pull/249, but for SparseAdagrad.

Also update the tests for both ops to test this functionality.
Closes https://github.com/caffe2/caffe2/pull/675

Differential Revision: D5148750

Pulled By: akyrola

fbshipit-source-id: d30b722429bc547fd53400c1a29e4ee9e2e6ed18
2017-05-30 12:02:18 -07:00
Luke Yeager
8bd0522c20 Add tests and GPU impls for sparse optimizers
Summary:
These GPU paths are probably even buggier than the CPU paths for sparse gradients with duplicate indices. Both paths cause multiple momentum updates in a single iteration, but only the GPU path is non-deterministic. Depending on how we decide to address the issues on the CPU path, pooyadavoodi has a good idea for how to match dense behavior with the sparse GPU ops.
Closes https://github.com/caffe2/caffe2/pull/254

Reviewed By: bwasti

Differential Revision: D4871680

Pulled By: dzhulgakov

fbshipit-source-id: 220be57a0f699a22ea85ed4f7022d92d362d06b3
2017-04-13 11:07:40 -07:00