Commit Graph

25 Commits

Author SHA1 Message Date
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Alisson Gusatti Azzolini
e3609a0619 Correctly propagate remap_blob across net boundaries
Summary: If a blob is copy from device A to device B in the init_net, and then is used as an external_input in the train_net, we want the train_net to correctly use the blob already on device B instead of copying it over and over again.

Reviewed By: akyrola

Differential Revision: D5800870

fbshipit-source-id: d93f44bba80e4ed70eb03183d552496b54a966b5
2017-09-24 21:21:57 -07:00
Lei Chen
14950a9082 Support session in distributed realtime trainer
Summary:
Convert from PlanDef ProtoBuf into python Plan object by recursively creating
Nets and ExecutionSteps.

Also support running Plan object directly in Session.

Reviewed By: azzolini

Differential Revision: D5608393

fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c
2017-08-16 10:28:55 -07:00
Yiming Wu
b51e0ec0c2 quick fix inplace blob bug
Summary: fixing the case where the init net will initialize same blob twice. I made an exception by allowing inplace blob among ops if the blob keeps on the same device. This should fix this problem in a generalized way as most of our training is only on CPU now.

Reviewed By: dzhulgakov

Differential Revision: D5450564

fbshipit-source-id: 525c4c9a2e5216a70dbd1229da2d9f8a58b89e47
2017-07-23 02:18:16 -07:00
Yiming Wu
1fce3eac4e single trainer hybrid device
Summary:
First try of single trainer hybrid device training for sparsenn

Comparison results with CPU training:
https://our.intern.facebook.com/intern/fblearner/run/compare/?compare_to[0]=20016969&compare_to[1]=19660293&baseline_run=19660293&all_runs[0]=20016969&all_runs[1]=19660293

Reviewed By: dzhulgakov

Differential Revision: D5205723

fbshipit-source-id: 4a024324ac2efc3248dd470d4c533cf2ecec2e92
2017-06-27 22:06:30 -07:00
Alexander Sidorov
c8410859d9 Operator python stacktraces, attempt 2
Summary:
Last time I used uuid filled into OperatorDef. And operator_tracebacks was populated using traceback.extract_stack. There were several issues with this approach:

1. A random field in OperatorDef breaks workflows relying on memoization, i.e. when computation is skipped based on already computed result before.
2. Adding one more field revealed RNNs being non forward compatible wrt to new fields in there. prototxt format seems to not allow forward compatibility (thanks jamesr66a for the investigation!). For RNNs we need to swtich them to a more resilient approach. azzolini's proposed change to OperatorDef / NetDef would allow that by just nesting NetDef dirrectly inside OperatorDef without need for extra serialization.
3. traceback.extract_stack is very slow when executable is on a remote filesystem. It does one or more os.stat for each frame on the stack. For some cases it ended up being up to 15 extra minutes on model construction.

In this diff I use a different approach which should fix all those problems above.

1.2. are solved by not adding a new field at all. Instead I report operator idx wrt to a net it runs in. Thanks akyrola and dzhulgakov for the idea. Downside here is that operator list manipulation breaks the logic and separately created ops are not covered at all.
3. I solved this by operating on raw frames without using traceback and inspect modules which end up doing a lot of file system calls. See function extract_stacktace in core.py with additional comments.

Reviewed By: dzhulgakov

Differential Revision: D5286285

fbshipit-source-id: 626dd0f5f6b8b1d86bd6bf519078b122f43ddcaa
2017-06-25 19:32:58 -07:00
Alexander Sidorov
83e6a0bec8 Revert uuid change to OperatorDef protobuf
Summary:
a few issues:

1. Randomization hurts memoization
1. Even if we make it non random, then we can get key colisions when loading it back.
2. RNNs use prototxt for step net and apparently its not forward compatible like normal protobuf is

I am thinking of a better less invasive solution now.

Reviewed By: jamesr66a

Differential Revision: D5272118

fbshipit-source-id: ab577fad04fbfc632e1fceffa923377a0d3da1be
2017-06-19 16:47:31 -07:00
Luke Yeager
90a52c3904 Skip TestInferDevice if no GPU support
Summary:
Working towards https://github.com/caffe2/caffe2/pull/817.
```
E           AttributeError: Method CopyCPUToGPU is not a registered operator. Did you mean: []
```
https://travis-ci.org/caffe2/caffe2/jobs/243867951
Closes https://github.com/caffe2/caffe2/pull/818

Differential Revision: D5276735

Pulled By: akyrola

fbshipit-source-id: 35d9df19330ae522037e8a5d721d83dc2e5aa4dc
2017-06-19 12:21:24 -07:00
Luke Yeager
8ef12951e0 Fix for protobuf with unicode_literals
Summary:
Python 2.7, Protobuf 2.6

    >                   op.ClearField('uuid')
    E                   TypeError: field name must be a string

Fix: http://python-future.org/imports.html#should-i-import-unicode-literals

/cc salexspb tomdz
Closes https://github.com/caffe2/caffe2/pull/804

Differential Revision: D5258494

Pulled By: akyrola

fbshipit-source-id: 04c473c1e55bf8caac0bfde7d86171c9f95e71a1
2017-06-15 13:22:57 -07:00
Alexander Sidorov
eebda50b79 Operator python traceback
Summary: This is going to show a python Caffe2 user where a failed operator was created. Motivation for having this information not right in protobuf is to avoid having it too verboose and keep ability to read protobufs of a net after a simple print() call.

Reviewed By: jamesr66a

Differential Revision: D5226047

fbshipit-source-id: 7edfe850e05a2ec209577142aa3368664a57a108
2017-06-13 18:50:02 -07:00
Yiming Wu
406d748423 better engineering for core_test.TestInferDevice
Summary: Recently people find that this test is too strict because of proto string matching. Thus, I change it to compare fields so that this test will not complain even if protobuf chnaged in future.

Reviewed By: dzhulgakov

Differential Revision: D5229855

fbshipit-source-id: 54efcd7a0f9e5dbba1ddeb480801abcb859e07bd
2017-06-12 15:23:00 -07:00
Luke Yeager
52ee7697f4 Fixing broken Python tests
Summary:
`brew_test.py` is just plain broken. `core_test.py` doesn't work with pytest. `apmeter_test.py` and `top_k_test.py` don't work for CUDA builds.
Closes https://github.com/caffe2/caffe2/pull/765

Differential Revision: D5211817

Pulled By: Yangqing

fbshipit-source-id: 78ec5af35a3fa870978e4c9590210ade9e3bc5ac
2017-06-08 13:34:46 -07:00
Yiming Wu
4fefff0bbb Auto injecting device copy for single net and several nets
Summary:
This diff plan to attack the problem where we want to just annotate device option for operators and leave Caffe2 to help us inject cross device copy functions. This feature would be useful for mixed device training and multi device training with several nets, where previously we do the heavy lifting of adding copy functions ourselves.

Ideally, this feature will happen like this:

      //construct your nets first
      core.InjectDeviceCopyAmongNets([train_init, train_net, ...])

My ideas are written in comments. I will update them here as well later.

Reviewed By: dzhulgakov

Differential Revision: D5134103

fbshipit-source-id: 173f7da9d1773d1c50ccdc27f1b5cd3067b04af5
2017-06-07 20:03:18 -07:00
Yiming Wu
8cd208ad6f Infer input and output device from OperatorDef through OperatorSchema
Summary: Infer input and output device from OperatorDef through OperatorSchema. This is inspired by shape inference. With this feature, we can easily analysis device information for all blobs in the net in a generic way. It is really helpful for auto cross device execution.

Reviewed By: akyrola, dzhulgakov

Differential Revision: D5161065

fbshipit-source-id: ee656123112171a4ca00f2fb3f6940f32ddf3135
2017-06-05 23:47:33 -07:00
Aapo Kyrola
5e6bd4fbfc Return predict params from ExtractPredictorNet + test
Summary:
Make it easier for users by returning from ExtractPredictorNet the list of blobs that must be saved/exported to run a predictor net. Added a test for ExtractPredictorNet

Codemod.

Reviewed By: asaadaldien

Differential Revision: D5176097

fbshipit-source-id: b1af42132459487b8d94fcdde0e4c514da608243
2017-06-05 15:34:37 -07:00
Thomas Dudziak
3ccbf23132 String-related fixes for Python 3
Summary: This diff is one step towards enabling python 3 build by making it be more diligent in its handling of strings.

Reviewed By: salexspb

Differential Revision: D4893083

fbshipit-source-id: 28b8adf3280e8d1f0a7dc9b0fee5ad53f2fada57
2017-05-26 16:04:32 -07:00
Aapo Kyrola
711ea1d4ac fix enternalinputs handling in AppendNet v2
Summary: External inputs must be computed before updating the _ops_output structure, otherwise if the net to be appended outputs the external input, it is not added correctly

Differential Revision: D5013496

fbshipit-source-id: 6a83d0a6f1c63ef8ae7bec4d862c0ac2a690d47b
2017-05-05 21:50:57 -07:00
Krishna Vudata
1f3c7f8080 Handle net.external_inputs correctly in AppendNet
Summary:
When appending net A to net B, an external input of net A should not be added as
an external input of net B if net B is outputting that blob.

Reviewed By: dzhulgakov

Differential Revision: D4975921

fbshipit-source-id: a5c0ada7b96d851e57d345244d322dd93c7be8e4
2017-05-02 11:20:26 -07:00
Xianjie Chen
d0621a2449 NextScopedBlob with well-defined behavior and respect namescope
Summary:
Remove the use of `NextName` in layer model helper, so that the same function return `model_helper` that should construct identical `Net`, when under the same NameScope.

The `NextScopedBlob` should only take effect when there is real name conflicting, otherwise it returns ScopedBlobReference.

This is critical for parameter blobs. In long run, we need to be able to specify parameter blobs more explicitly. (kennyhorror is working on this). This solution works in short term for e.g., two tower sparse nn models.

Reviewed By: kennyhorror

Differential Revision: D4555423

fbshipit-source-id: 2c4b99a61392e5d51aa878f7346466a8f14be187
2017-02-16 17:16:36 -08:00
Aapo Kyrola
dcefc74a0c Shape and Type Inference Part1
Summary:
This is a bit large diff, sorry about it. It includes basic shape and type inference functionality, based on YQ's Schema scaffolding. I added some helper functions to make it easier to write simple translations.

Bigger refactoring was needed for ConvPoolBase so that we could use the shape inference already there in the schema.

I annotated enough operators to be able to infer forward-pass of shapes for basic convnet, and added test for that. I intend to bootcamp some annotations and annotate enough to handle Resnets fully. Need to think about gradients, if they could be annotated in an easier way.

Only shapes are now exposed to Python, types will follow later. Also the inference is not called yet anywhere but unit test.

Also I am not sure if everything is in the best location in the code, but shouldn't be hard to move stuff around.

Reviewed By: dzhulgakov

Differential Revision: D4436818

fbshipit-source-id: eebee5937ccc9ac09c245465302388a1fae6933c
2017-02-02 22:29:22 -08:00
Aapo Kyrola
d38499f727 Optimize BlobIsDefined() + benchmark --> net construction 95 secs to 8.2 secs!
Summary:
I have noticed that constructing the Xray model takes quite a while. To measure this, I wrote a benchmark script that creates a resnet-50 model on 8 gpus. This takes about 95 secs -- which is kind of annoying when you want to quickly debug stuff.

Profiling (using Python's cProfile), I was able to see that the most of the time is used in net.BlobIsDefined(), which does a linear search over external inputs and operator outputs. Thus it gets slower and slower with large nets.  This can be fully optimized by keeping a separate lookup table of operator inputs and outputs (and external inputs and outputs). It is a bit annoying to keep this separate data structure, but I setup the unit tests to ensure things are doing correctly over Clones.

After the optimization, the net construction drops from 95 secs to 8.2 secs!

Reviewed By: azzolini

Differential Revision: D4288307

fbshipit-source-id: 0bb82c8bde9d86a2702b298f4aa706cba509346e
2016-12-15 12:01:30 -08:00
Pieter Noordhuis
c48551409c Proper error message if passing NoneType value for kwargs
Summary:
I got a weird error about NoneType not being iterable which made me think
it was some error in the C2 core, whereas it was an error in my code.

Reviewed By: Yangqing

Differential Revision: D4192799

fbshipit-source-id: 0122f13e205c1c6a0766545f0ad6296228d3a3d9
2016-11-29 15:18:36 -08:00
Yangqing Jia
05512d1e10 sync 2016-08-10 11:02:15 -07:00
Yangqing Jia
6463eebc7b chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
Yangqing Jia
559053d3a8 chunky sync 2016-05-13 14:43:48 -07:00