Commit Graph

52 Commits

Author SHA1 Message Date
Wei Zhang
e0e334793c Revert D7219461: Mark full sync data parallel ops with rules
This reverts commit 79c56ec5859e25c7caec7bb6b79e80dd19307c64
2018-03-20 13:34:22 -07:00
Wei Zhang
9edbafe0de Mark full sync data parallel ops with rules
Instead of using hard-coded rules or rely on gpu_strategy to mark full sync data parallel ops, we need some generic rules that is applicable to both the single and distributed setting.
2018-03-20 13:34:22 -07:00
Kittipat Virochsiri
72f2cd8bcc Making preproc_output_schema explicit
Make it easier to plug in intermediate steps between preprocessing & trainer by maintaining a stable schema.

I also fixed enqueue() so that we can pass in the same blob in multiple location without causing data corruption.
2018-03-20 13:34:22 -07:00
sf-wind
602a09dde7 Update caffe2 from facebook 4f527ef46abf (#2234)
* [GanH]: two_task_discriminator

as titled

and adding label smooth

* [Dper2] Simplified UI options needed for blob magnitude visualization

* [GanH]: fix tags

as titled

* Added type and shape inference for GatherRange operator

This helps with type / shape inference when using this operator in layers.
Also just a nice to have in general.

* Demonstrate Caffe2 exception handling with StoreHandlerTimeoutError in Python

We'd like to catch and recover from certain Caffe2 net exceptions. Use this diff to demonstrate a pattern of registering a pybind exception mapping and catching in Pythonusing caffe2::StoreHandlerTimeoutException.

* Bind Gloo IoException to IoError in Python

Allow peer failure handling and recovery using an exception based mechanism. This diff registers gloo::IoException with pybind.

* [GanH]: add label smoothing to softmax with loss

as titled

* [C2] Enable LARS in Adagrad and hook it to DPER

* [DPER] Don't pass LayerModelHelper in create_trainer_nodes

Since we're planning to get rid of it eventually and I want to get access to
NetDef only interface ASAP - I'm looking towards removing all references to
LMH, where we don't really need them.

* fix bugs in LambdaRankNdcgOp

the loss and gradient in LambdaRankNdcgOp are incorrect. The loss should be negative log of probs instead of log.

* Restrict thread pool on iOS to only big cores

Historically, iPhones exposed only one type of cores, and Caffe2 thread pool used all of them.
However, iPhone 8/iPhone X exposes 2 big + 4 LITTLE cores. As our thread pool doesn't support work stealing or other forms of load balancing, fast cores end up waiting for the slow ones, and it may be better to restrict execution to only 2 fast cores, like we do on Android.

* Remove SparseLength Sum/WeightedSum/Mean operators with fp16 engine

Remove SparseLength Sum/WeightedSum/Mean operators with fp16 engine

* make clang happy and get fewer warnings

make clang happy and get fewer warnings

* [Personalization] Support add_output_schema() in layer_model_helper

Problem:
Currently the output_schema of sparse_nn can only be set once. https://fburl.com/efth5zer.

Solution:
For flexibility, we want to add fields to output_schema incrementally.

Plan:
Wrap the change of `model._output_schema` into a new function `add_output_schema()` for adding additional output_schema.

Callsite:
The add_output_schema() should be called instead at https://fburl.com/efth5zer

Reference:
The newly added `add_output_schema()` will be similar to `add_loss()` in https://fburl.com/t2ii8njh
2018-03-12 12:22:59 -07:00
Jiyan Yang
f4b1e8b334 [Dper2] Add NetModifier abstraction and support for plotting the norm of blobs (#2201) 2018-03-08 13:41:32 -08:00
Dmytro Dzhulgakov
16ba087b64 [oncall]fix unittest dper/layer_models/tests:utils_test
as titled -- fix offending diff D7091725 due to added debug_info in operator
proto
2018-03-06 00:33:11 -08:00
Yan Zhu
0a66c76a4c detailed error output for parameter sharing
Reviewed By: xianjiec

Differential Revision: D6986239

fbshipit-source-id: 5b8bb06ea2383ce64318b5322bda7a58469f3eb0
2018-02-14 11:10:51 -08:00
Yan Shang
fd28e0fa29 Add bool function to return whether a model contains loss
Summary:
Add a function to return true if the model contains loss and retuen
false if the model doesn't include a loss.

Reviewed By: kittipatv

Differential Revision: D6982444

fbshipit-source-id: 1f63b7a1eaa3077841a0ad5d8d854b471d0aa84c
2018-02-13 16:38:36 -08:00
Kittipat Virochsiri
83c494787d Allow adding to trainer_extra_schema
Summary: Sometimes we need to add some extra schema later

Reviewed By: sunnieshang

Differential Revision: D6951849

fbshipit-source-id: 564eb88f9250eae24869fd10ba3426e00a18af33
2018-02-13 14:40:36 -08:00
Lin Yang
3acce3e4a7 assert global_constant name as string
Reviewed By: kennyhorror

Differential Revision: D6895157

fbshipit-source-id: 9844ab6176d22c6d05a5a0f83b731f734ef9853d
2018-02-04 01:02:30 -08:00
Xiaolong Wang
f8575f6d68 Breakdown Dispatcher
Summary: dispatch by Ngram breakdown

Differential Revision: D6794082

fbshipit-source-id: 7f6e8fa3a0abe0dc6d0d466c95e8c4fc865e3abb
2018-01-26 17:47:54 -08:00
Dániel Simig
2dd79eb53a Visualize distribution of activation functions
Summary:
This is a  first attempt at completing bootcamp task T24449916. This diff contains 3 major changes:
1) Change LayerModelHelper to allow for exposing the output and parameters of any layer to metrics
2) Added a runner that allows metrics to draw arbitrary plots to a matplotlib axes object
3) Implement a metric that aggregates distributions of values in a blob over the training, and try this out in a notebook

Reviewed By: kennyhorror

Differential Revision: D6671273

fbshipit-source-id: b8961837395e89c957edbf5c7c862bdb845ccf4b
2018-01-23 10:36:40 -08:00
Xue Feng
dda33ca53a enable setting model initialization seed
Summary: This diff enables setting model initialization seed, instead of random seed, when reproducible restults are desired.

Reviewed By: xianjiec

Differential Revision: D6642971

fbshipit-source-id: 387b1ee2ecef4f8f66570c882498fb97d7007e17
2018-01-11 14:04:03 -08:00
Yan Shang
41bb662d96 add dense regularization
Reviewed By: xianjiec

Differential Revision: D5617571

fbshipit-source-id: 875d7c8753bdb3b6847d5e3f47ad8568cdf172f8
2018-01-08 13:03:17 -08:00
Xiaolong Wang
7315a19bc9 add maybe_add_global_constant
Summary:
In layer model helper, add a method `maybe_add_global_constant` to ensure
that when two global constants are added with the same name, we check if they
are actually the same (by initializer) and only add it once.

Reviewed By: kennyhorror

Differential Revision: D6537532

fbshipit-source-id: 37aa3860a2e40d81161ccdea0c50a316248be2e2
2017-12-18 22:14:00 -08:00
Liang Xiong
fc0c8c2316 minor refactoring in dper
Summary: small changes as I was reading through the dper code base. all of them are nits, but somewhat helped me understanding things.

Reviewed By: xianjiec

Differential Revision: D6389380

fbshipit-source-id: 3412052e4fcba199c6ffc84c6f7ae11bf8ff6ee9
2017-11-21 18:12:49 -08:00
Yan Shang
24e83acbb9 Enable sampling in evaluation
Reviewed By: chocjy

Differential Revision: D6119768

fbshipit-source-id: c8447326008392df70ab10b04f84223cf6d882b1
2017-11-16 14:03:51 -08:00
Jiyan Yang
ee3baa2ed4 Add shape checks and print more info in parameter sharing
Summary: As titled.

Reviewed By: kittipatv

Differential Revision: D6145747

fbshipit-source-id: 39a212bb6bebbbf3164cade2f95db22ddb2d2c87
2017-10-27 01:22:06 -07:00
Dmytro Dzhulgakov
2972a6ca02 Revert D6026557: [caffe2][PR] Fix "No handlers could be found for logger"
Summary:
This reverts commit 95c634872ac02be721257169e38c8fead04cd66b

bypass-lint

Differential Revision: D6026557

fbshipit-source-id: 663c28583ce3b01070ff5449115ed7e222f71776
2017-10-12 20:21:52 -07:00
Luke Yeager
75bece6ede Fix "No handlers could be found for logger"
Summary: Closes https://github.com/caffe2/caffe2/pull/1316

Differential Revision: D6026557

Pulled By: Yangqing

fbshipit-source-id: 95c634872ac02be721257169e38c8fead04cd66b
2017-10-10 22:32:13 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Aapo Kyrola
2108d1c250 Add unit-tests for fb-specific models
Reviewed By: azzolini

Differential Revision: D5895367

fbshipit-source-id: e7a7cdb272cdcdd7495efe9a6203750d1e6d6c48
2017-09-26 21:17:51 -07:00
Kittipat Virochsiri
1b059f4c98 Add option to ignore parameter initialization
Summary: When parameter sharing is used, the model may not own the parameters. Emptying out initializer ensures that the shared model doesn't overwrite initialization.

Reviewed By: chocjy

Differential Revision: D5870362

fbshipit-source-id: f8587b84c3a13f331a3251973e8206563939606a
2017-09-20 12:03:22 -07:00
Kittipat Virochsiri
7d2b2cae19 Remove OFFLINE_TRAINING from global constant
Summary: This is not a very generic constant

Reviewed By: volkhin

Differential Revision: D5870378

fbshipit-source-id: 59509bb48cecb52ba4a3f26b290855374547fe7e
2017-09-20 12:03:21 -07:00
Jiyan Yang
a8695178aa Adding parameter sharing API to Dper2
Summary:
To achive this, I modified the blob name scheme defined in a layer.
Before it was scope/fc_w and scope/fc_w_auto_0 (if there is another fc
    within the same scope).
Now I change it to scope/fc/w and scope/fc_auto_0/w.
That is, we rely on the uniqueness of the scoped layer name to define
names for blobs.

I also overwrote the create_param method in LayerModelHelper to let it
use the resolved name for blobs given the sharingparameter context.

There are some details such as making the initializer more structured
that I need to finalize.

Reviewed By: kennyhorror

Differential Revision: D5435132

fbshipit-source-id: a0525f5ea0977e255dd5ea765b38913f5951d455
2017-08-03 00:33:18 -07:00
Jacqueline Xu
3cc03568da Fixing error message for layer model helper
Summary: - Minor fix for error message in layer model helper file

Reviewed By: chocjy

Differential Revision: D5440768

fbshipit-source-id: df47bfe68a0caa750f0d3c8def28a5585e465ee0
2017-07-18 09:52:45 -07:00
Honghao Wei
b68adec7bb adding model loss logic
Summary: Add api model.add_loss(), which allows adding loss, such as optimization and regularization. See change in sparse_nn.py, in which 'model.loss = loss' is changed to 'model.add_loss(loss)'.

Reviewed By: xianjiec

Differential Revision: D5399056

fbshipit-source-id: 13b2ced4b75d129a5ee4a9b0e989606c04d2ca8b
2017-07-14 16:25:23 -07:00
Thomas Dudziak
5355634dac Dict fixes/improvements and unittest targets for Python 3 in caffe2 core
Summary: As title

Reviewed By: salexspb

Differential Revision: D5316104

fbshipit-source-id: aee43819d817842e5ce6ba3d045a55b1a2491c30
2017-06-29 17:05:41 -07:00
Yiming Wu
1fce3eac4e single trainer hybrid device
Summary:
First try of single trainer hybrid device training for sparsenn

Comparison results with CPU training:
https://our.intern.facebook.com/intern/fblearner/run/compare/?compare_to[0]=20016969&compare_to[1]=19660293&baseline_run=19660293&all_runs[0]=20016969&all_runs[1]=19660293

Reviewed By: dzhulgakov

Differential Revision: D5205723

fbshipit-source-id: 4a024324ac2efc3248dd470d4c533cf2ecec2e92
2017-06-27 22:06:30 -07:00
Bokai Cao
d9087edb07 add rekey in feature_processor
Differential Revision: D5270972

fbshipit-source-id: 8805c0e947f4752d2c575e2a7b8986cd804601dc
2017-06-20 23:19:09 -07:00
Bokai Cao
d2b1cb22a4 rekey layer
Differential Revision: D5210095

fbshipit-source-id: dc66a10d95842e0f10cb53a5afb7ddcc3fcac0de
2017-06-19 18:47:28 -07:00
haracejacob
2ec294a8bb Fix a few typos and grammars in comment
Summary:
Fix a few typos and grammars in comment

by using language-check, python library
spell_checker source code is here : https://github.com/17-1-SKKU-OSS/011A/blob/master/spell_checker/spell_checker.py
here is the text file which indicates what things should be fixed :  https://github.com/17-1-SKKU-OSS/011A/tree/master/spell_checker/fix/caffe2
Closes https://github.com/caffe2/caffe2/pull/719

Differential Revision: D5165118

Pulled By: aaronmarkham

fbshipit-source-id: 7fb8ef7a99d03cd5fd2f9ebdb01b9865e90fc37b
2017-06-14 18:22:39 -07:00
Huazhong Ning
660dd58022 fix for realtime training.
Reviewed By: kennyhorror

Differential Revision: D5068298

fbshipit-source-id: 0dc3580c9c8123368a3625fb654c6eaf1dc4a950
2017-05-26 23:49:40 -07:00
Yang Yang
769e668faf ttsn model fails to set optimizer for FC layer
Summary:
the FC ModelLayer needs an optimizer, also seems the catch-all
that sets a default for missing optimizers had a bug

Reviewed By: xianjiec

Differential Revision: D5048302

fbshipit-source-id: cbbf641fb9ee4f4f89c5dbb132f7837ecdbe37a5
2017-05-16 11:26:02 -07:00
Huazhong Ning
942f53b5a6 gradient impact of task layers on shared is configurable
Reviewed By: chocjy

Differential Revision: D4943948

fbshipit-source-id: 2e26dfb30c6893b60985f693a823646ed3d3e0e3
2017-05-11 20:34:04 -07:00
Alisson Gusatti Azzolini
20d8de8d51 Parameter cost estimation job
Summary:
Adds a parameter cost estimation step before the actual training starts. The costs are later used in order to better shard the parameters across instances of the parameter server.

Things I needed to modify:
- A few changes to make ModelLayerHelper picklable
- Add support for stopping a distributed job after a number of stats reporting steps.
- Refactored run_dist_job to support collocating the reader with the trainer even when PS are present.
- Option to disable dense updates (when num_dense_servers=0).

Currently there's a huge overhead posed by having to launch a child workflow. I'll try and address next in a subsequent diff.

This is WIP because the other workflows need to be migrated as well.

I can break this down into smaller diffs if reviewers would prefer it.

Reviewed By: kennyhorror

Differential Revision: D4974752

fbshipit-source-id: 04c336acb2945f8f11324a221ffc6967818c0672
2017-05-09 13:02:24 -07:00
Kittipat Virochsiri
7d6d67119f Allow LayerModelHelper to keep input blobs from schema
Summary: In certain situation, like in D4907916 where we insert additional step in the middle of a model, it's neccessary to keep the blob names constant across model helper so that it doesn't break communication schema.

Reviewed By: kennyhorror

Differential Revision: D4981527

fbshipit-source-id: 6b8d6d240279dd48f801cfacbaa1d320ba54d694
2017-05-01 21:31:36 -07:00
Yiming Wu
bef6e45f8b rename ModelHelperBase
Summary:
rename ModelHelperBase to Model.

This is the result of running:

  find . -type f -exec sed -i 's/ModelHelperBase/ModelHelper/g' {} +

We had 19 results when fbgs ModelHelperBase. Here is 20 instances because I added 1 test in model_helpers_test.py

Reviewed By: salexspb

Differential Revision: D4928337

fbshipit-source-id: bc4c12b60b90c167e717de50ea9fe17521e142e3
2017-04-24 15:52:26 -07:00
Huazhong Ning
ad6b53e401 allow to specify output dtypes for functional layers
Summary:
Currently, the functional layer infers the output types and shapes by running the operator once.
But in cases where special input data are needed to run the operator, the inferrence may fail.
This diff allows the caller to manually specify the output types and shapes if the auto infererence may fail.

Reviewed By: kennyhorror

Differential Revision: D4864003

fbshipit-source-id: ba242586ea384f76d745b29a450497135717bdcc
2017-04-18 16:34:52 -07:00
Jiyan Yang
a7217e6626 Remove unused optimizers
Summary: As desc.

Reviewed By: xianjiec

Differential Revision: D4840482

fbshipit-source-id: bf820154475508ce581d16a45bcd93d026b60f30
2017-04-05 21:18:29 -07:00
Aaron Markham
58f7f2b441 doxygen python block added
Summary: Closes https://github.com/caffe2/caffe2/pull/226

Differential Revision: D4793550

Pulled By: JoelMarcey

fbshipit-source-id: cc33e58186304fa8dcac2ee9115dcc271d785b1e
2017-03-29 06:46:16 -07:00
Andrey Malevich
7cc92b1260 Add eval net for layer_model_helper
Summary:
This diff is adding eval nets to layer model helper. It should be useful for
the cases when train/eval nets need some extra input (usually some supervision)
for train/eval. For example various sampled layers, etc.

Differential Revision: D4769453

fbshipit-source-id: 7a8ec7024051eab73b8869ec21e20b5f10fd9acb
2017-03-29 04:03:40 -07:00
Kittipat Virochsiri
6163676ebe Skip optimizer when param doesn't have gradient and optimizer is not set
Summary: Currently, we cannot have layer constant because layer params are required to have gradient and optimizer. Global constants don't cut for this because it can only be added once; therefore, a layer that add any global constant can only be used once.

Differential Revision: D4773212

fbshipit-source-id: 5b60d31f3c1602afb04b61f6d30b8e3e06ed2de3
2017-03-24 22:18:34 -07:00
Xiaolong Wang
8ce34d6c87 Add Calibration
Summary: Add calibration to sparse_nn

Differential Revision: D4735564

fbshipit-source-id: 6baa637cbffcbbd50134a256d622ef8c962fca3b
2017-03-24 14:32:23 -07:00
Andrey Malevich
b599910f3a Use new metric intefaces in trainer workflows.
Summary: This diff is migrating existing DPER workflows to use new metric abstractions in training.

Reviewed By: xianjiec

Differential Revision: D4656576

fbshipit-source-id: 1b3b16b390fc0757480e41df1c4214c11cd76e8a
2017-03-07 12:46:52 -08:00
Qichao Que
2f68632a32 Add SparseNN workflow for feed.
Summary: Add SparseNN workflow for feed. I haven't fully thought about the change needed for ads, as I added a property called 'preproc_output_schema' for LayerModelHelper.

Reviewed By: xianjiec

Differential Revision: D4585796

fbshipit-source-id: 060d08f4beb928e7e7863f2e563f612c358951fb
2017-03-01 11:02:38 -08:00
Andrey Malevich
a3726759c6 Add a way do describe layers in a more AdHoc manner.
Summary:
This diff is trying to address one of the concerns that Xianjie have had - requirements create a layer for all operators and attach pass shapes and other info around.

The basic idea of the diff:
1. Try to create a layer with a given name, but if it's not available try to fallback on operator with that name (that is expected to have no parameters).
2. For all operators that we're adding through this functional style of creation - try to use C2 Shape/Type inference logic to get output type. If we fail to get - it just return untyped record and expect user to annotate it when it's really needed.

Reviewed By: xianjiec

Differential Revision: D4408771

fbshipit-source-id: aced7487571940d726424269970df0eb62670c39
2017-02-27 23:30:39 -08:00
Xianjie Chen
d0621a2449 NextScopedBlob with well-defined behavior and respect namescope
Summary:
Remove the use of `NextName` in layer model helper, so that the same function return `model_helper` that should construct identical `Net`, when under the same NameScope.

The `NextScopedBlob` should only take effect when there is real name conflicting, otherwise it returns ScopedBlobReference.

This is critical for parameter blobs. In long run, we need to be able to specify parameter blobs more explicitly. (kennyhorror is working on this). This solution works in short term for e.g., two tower sparse nn models.

Reviewed By: kennyhorror

Differential Revision: D4555423

fbshipit-source-id: 2c4b99a61392e5d51aa878f7346466a8f14be187
2017-02-16 17:16:36 -08:00
Xianjie Chen
fb7c9108d9 get parameter blobs of a model
Summary: to verify that a model only used a subset of the parameters of another model (e.g., the model doing training).

Differential Revision: D4557787

fbshipit-source-id: bd8ac96f5e78e05f6f56086db6e6ddcda36c1d37
2017-02-15 16:00:44 -08:00
Andrey Malevich
ec51f887bf Create only one instance of SigridTransform in DPerExample.
Summary:
DPer example have been creating multiple copies of the transform config in net
defition till this moment, that resulted in the fact that I've hit the limit of
ProtoBuf (64MB) for a certain Task requests (especially visible because of the
ValidationPipeline that I was adding).

After this diff we're going to store SigridTransforms in one instance per
machine for training (or 1 instance per reading).

Difference in sizes of the plans for some simple SparseNN model ~30 MB (even including the fact that second model have validation plan as well).

TODO: Do similar logic for NNPreProc as well (it's also pretty large).

Reviewed By: dzhulgakov

Differential Revision: D4441441

fbshipit-source-id: 4452dd86a4dc49b2c7f5b7642f443aed5720b047
2017-01-22 19:29:16 -08:00