Commit Graph

27 Commits

Author SHA1 Message Date
Yangqing Jia
7d5f7ed270 Using c10 namespace across caffe2. (#12714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714

This is a short change to enable c10 namespace in caffe2. We did not enable
it before due to gflags global variable confusion, but it should have been
mostly cleaned now. Right now, the plan on record is that namespace caffe2 and
namespace aten will fully be supersets of namespace c10.

Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where

```
using namespace c10;
```

is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with).

Reviewed By: dzhulgakov

Differential Revision: D10390486

fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b
2018-10-17 12:57:19 -07:00
Yangqing Jia
38f3d1fc40 move flags to c10 (#12144)
Summary:
still influx.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12144

Reviewed By: smessmer

Differential Revision: D10140176

Pulled By: Yangqing

fbshipit-source-id: 1a313abed022039333e3925d19f8b3ef2d95306c
2018-10-04 02:09:56 -07:00
Edward Yang
91797c0672 Replace direct include of caffe2.pb.h with an intermediary header caffe2_pb.h (#10946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10946

```
codemod -d . --extensions cc,cpp,cu,cuh,h caffe2/proto/caffe2.pb.h caffe2/proto/caffe2_pb.h
```

Reviewed By: houseroad

Differential Revision: D9539945

fbshipit-source-id: 497d04720e8e7e61c05ffe1b23733d0cb774de7e
2018-08-28 11:57:08 -07:00
Keren Zhou
6c6a353a66 Fix speedbenchmark bug (#9770)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9770

Add zero ops to operators that do not have a valid schema

Reviewed By: hlu1

Differential Revision: D8957472

fbshipit-source-id: d8d0a351183e88ace2e050a87c1e1c363af67e33
2018-07-24 17:10:37 -07:00
Orion Reblitz-Richardson
d1bdb3b10a Remove core and util warnings (#8239)
* Fix some signed/unsigned mismatches

* Skip unused result warning

* Explict fallthrough for murmur hash

* Enable aligned new support to eliminate warning

* Switch to int instead of unsigned in some cases
2018-06-07 09:10:33 -07:00
bddppq
f94ae3ba1d
Update from facebook (#7696)
* Fix handling of empty batches in SumReduceDimsOp

As titled

* Deferrable async_scheduling finishRun fix

Proper order of finishing run operations in deferrable_async_scheduling net

* Simplify exception handling in async_scheduling

Simplify exception handling, no need to busy wait, thread that processes the
last task can finish the run

* [C2]worker_coordinator_memorize_worker_ids

As titled. This is related to T28689868, where the number of blobs we want to create is equal to the number of worker ids

* Add unit test for nets with no type set

* Ignore total length argument in sympolic_pad_packed_sequence

1- There was a mistake in the code that total_length was added to the wrong symbolic function (pack_padded_sequence) instead of (pad_packed_sequence)
2- No need to throw an exception if total_length is given since it is only used to enable data_parallel training on multi-gpus and doesn't have anything to do with onnx export, so just ignore it. https://fburl.com/tk4gciqp

* Add support for MKLDNN to async_scheduling

Just add MKLDNN as a possible CPU option to async_scheduling's pool function

* [AuFL][ensemble] support branch output for prediction

This diff supports using predictions from different branches and thus enables model ensembling (not fully independent).

* Fix a bug in add_loss in layer_model_helper

As titled.

* Support lradaption for adam

1.lr adaption operator
2.apply to dense adam

* Perf tweaks for async_scheduling

Restore single pool option + remove unnecessary (no-ops) calls

* add quantization to SparseSimdAdagradOp

add a bunch of quantization signatures to SparseSimdAdagradOp, implementations to come next

* [sr] [codemod] Change all SR callsites to use new API

@allow-large-files

This diff refactors all callsites of SR to use the slightly changed API introduced in the diff below. Really what this means is that you need to include the correct header. Also if you were using `ClientFactory::newFactory` you need to not prefix it with `ClientFactory::`.

```
cd ~/fbsource/fbcode
find ./ -type f -exec sed -i -e 's:#include "servicerouter/client/cpp2/ClientFactory.h":#include "servicerouter/client/cpp2/ServiceRouter.h":' -e 's:#include <servicerouter/client/cpp2/ClientFactory.h>:#include <servicerouter/client/cpp2/ServiceRouter.h>:' -e 's/ClientFactory::newFactory(/newFactory(/g' {} \;
```

Also manually fixed spots that couldn't be done automatically (or broke because they depended on transitive includes).

* Back out "Fix handling of empty batches in SumReduceDimsOp"

Original commit changeset: 282da1730cc2 This commit is blocking the
Github->fbcode sync, which really needs to get merged ASAP. D7881937 which this
diff depends on will be reverted in the sync D7990948 which causes this to
break. The sync diff cannot be patched with this reversion because it must be
landed against base revision 5c8c099 , and D7881937 must not be included in the
sync diff because it is breaking GPU tests that are not available in sandcastle
: https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-cuda8.0-cudnn6-ubuntu16.04-test/3638/console
for one example.

* Add the flow to support operator benchmark

1) generate model with the operator 2) upload to everstore 3) generate model spec into json file 4) start running the benchmark

* [tum][gpu] Connect DPM trainer with flow and unit tests

This diff:
- Fix some small bugs for Yiming's recent changes to parallelizer, so it suits real use cases.
- Add correct tags to the TUM code, so we can do data parallel transform
- pass extra info when instantiation.
- add unit test for using DPM in TUM model

After this diff, we can do simple box, multi-gpu fully-sync trainer for TUM in Fblearner workflow, but may still need to do speed benchmarking.

* w/o normalized lradaption for adam dense only

The previous lr adaption includes a normalization step when performing the dot product operation. This is not exactly same as what is proposed in the paper. I add normalization as an option. Without it, the operator performs exactly what the paper proposed. With the option, we add the normalization step

* [fb] Use SharedPromise in DeferrableAsyncSchedulingNet

This code is to simplify DeferrableAsyncSchedulingNet by removing condition
variable + small fixes

* [tum] implement cuda sparseLengthsMean and LengthsMean

as title

* Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.

Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.

* Move feature_to_index to FeatureSpec.feature_to_index

move feature_to_index to FeatureSpec.feature_to_index to avoid override other fields

* [Caffe2] Rename bytes_moved to bytes_written

Just a rename in preparation for supporting bytes_read.

* [c2] fix ReduceFrontSumOp for empty case by setting 0

otherwise, it may use the results from last iteration when it's empty batch.

* [Caffe2] [Int8] Improve Intel CPU performance

* [Easy] Improve PrependDim op logging

as titled

* DBFileReader expand db_path using os.path.expanduser(..)

Since there are a lot of possible use cases of `DBFileReader` to read from user home path, like `~/local/sample.db`, I want to save people's trouble of calling `os.path.expanduser(db_path)` themselves.

* [Caffe2] Add bytes_read to cost structure

We're adding analytical read bytes to cost functions.  This extends the structure accordingly for all CostInference defined operators.
Additionally, some small bug fixes were performed:
1) Cost functions now extract type information of operands instead of assuming float

* Fix sleef on aarch64 for hhvm

@bypass-lint

Rename flag

* Remove duplicated part in caffe2/ideep/operators/conv_op.cc

should be sync error

* Rename test helper function test_adagrad_sparse_helper to adagrad_sparse_test_helper to avoid confusing pytest
2018-05-19 23:10:48 -07:00
Yinghai Lu
2863d935b9
[Caffe2] Fix of the performance issue of IDEEP (#7503)
* Sketch fix of the performance issue of IDEEP

* Revert CMakefile

* Fix tests

* format

* comments

* Print error

* review comments
2018-05-11 13:43:41 -07:00
Lu Fang
664fe34e0a
[Caffe2][fbcode=>GH sync] Update from facebook 4323b18ce13c (#7116)
* [fix] Re-enable events in RNN ops

We have earlier added event disabling in RNN ops as back then we didn't use
events, with current use cases this is no longer true
(https://fburl.com/8vd0lp8y)

* use ops with cude impl

* Revert D7729695: [caffe2][fix] Re-enable events in RNN ops

This reverts commit 4b215c7496fb724656ff4c776933a15bdbbcde5e

@bypass-lint

An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files

* [observer] Clean up observer_config.h

#accept2ship

* [1/n] Refactor dataio_test.py

Replace code duplication with a common function

* Add barrier net that runs before training nets

Add a synchonize barrier net that is run before training nets.  With this net, shards that are faster will wait for other shards before start training.  This reduce chances of the faster shards timing out during GLOO AllReduce.

Removed explicit data_parallel_model.py.synchronize call in holmes workflow.  Similar change in speech/asr_training workflow will come in another diff.

* Support the dnnlowp backend in caffe2_benchmark

This is for SHARE operator latency evaluation

* Migrate integral_image_op to main caffe2

migrate integral_image_op(GPU version) given by https://fburl.com/yvqezigi
to caffe2/caffe2/operators and implement its CPU version. Write up a test
using the hypothesis_test mechanism

* [pos_disc, fbcode] Implement unjoined lr loss

As explained in https://our.intern.facebook.com/intern/wiki/Model_Based_Calibration/, when the dataset is an joined data set, where labels might change later, we need to use unjoined logloss.

The implementation is almost the same as in Sigrid (https://fburl.com/1trngsls), where
    loss = y (log(p) - log(1-p)) + (1-y)(log(1-p)) = xy - (1-y)x - (1-y)log(1+exp(-x))

For x < 0, to ensure stability and avoid overflow, we reformulate the above exp as
    loss = xy - (1-y)x - (1-y)x + (1-y)log(1+exp(x)) = xy + (1-y)log(1+exp(x))

Then the final expression becomes
    loss = xy + (y - 1) x (x >= 0) - (1 - y) log(1 + exp(x - 2 x (x >= 0)))

where y is the true label, x is the dot product and p = logistic(x).

This kind of implementation is align with the current implementation of the original cross entropy in
https://phabricator.intern.facebook.com/diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/cross_entropy_op.cc;0bae3b5d0f825897c5e0dd0ff10f489d7271bf25$7-13

* Keep the array to fix the conflict

* [C2] Compute Adagrad effective LR

The AdagradWithLR op outputs an extra blob which is contains the average effective learning rate across all weights in this blob.

* Open-source extractMetaNetDef & runGlobalInitialization, add new Predictor constructor from db file, and add run_map_outputs

1. Open-source extractMetaNetDef and runGlobalInitialization, for use in
2. new Predictor constructor from db file.
3. Add new run function that returns outputs as TensorMap

* Disable eigen cpu

Disable eigen cpu in transpose and reduce

* Introduce request_only/object_only property of ModelLayer

by default this is False

* A simple TC Caffe2 benchmark

We can run tunner, get MappingOptions and then use them to
compare against cuBLAS

currently broken due to LLVM issues. How to run:

hg checkout eec1ab31b59c03b8deded1c755a9abaf8c45be01
add D7401202
add D7434625
add D7506031
add D7540728

buck run @mode/dev-nosan tc/tc/benchmarks_python:caffe2_benchmark

* Move Caffe2 feature_maps_ops to open source

Need feature maps operators in open source project facebookresearch/BlueWhale

* Manually fix the conflicts in channel shuffle op

* Fix the inconsistency between different gh and fbcode

* Skip Adagrad GPU Test (Because some gpu implementation is missing)

* Fix another test to make sure it won't run on gpu when implementation is not available yet
2018-05-01 20:49:00 -07:00
Orion Reblitz-Richardson
0ac4d19a29 Linter changes. 2018-03-30 21:00:44 -07:00
Orion Reblitz-Richardson
02786a3819 Linter changes. 2018-03-30 21:00:44 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Marat Dukhan
9123fcc857 Use std::cout instead of LOG(INFO) in TEST_Benchmark implementation
LOG(INFO) can be stripped out at compile-time or disabled at run-time,
but there're hardly use-cases where we want to call TEST_Benchmark,
but don't want to see the result. Additionally, on Android, LOG(INFO)
writes to logcat, which is OK for errors/warnings, but inconvenient
for benchmarking results, as on new phones logcat spawns logs like crazy.
2018-03-20 15:31:03 -04:00
Zhicheng Yan
06f8fc3f49 extend_operator_CostInferenceFunction
Summary:
- Extend SimpleNet::TEST_Benchmark to report extra FLOP, feature map memory, parameter memory at operator-level
- Add cost interfence function for 3D conv, sum, relu, spatial_bn, fc operators.

Reviewed By: sf-wind

Differential Revision: D6909893

fbshipit-source-id: 534492ccf2e15860e86f1e7f759ff338bf57753f
2018-02-09 10:56:29 -08:00
Alexander Sidorov
54f6b18168 Caffe2: Make SimpleNet simple again
Summary:
There is a lot of bussiness logic around various events in
the base net class. SimpleNet doesn't have to handle those (checked
with ilia-cher). Normally these should be no events registered for
simple nets, but we can have some issues where they will be added, so
its less error prone to just have a SimpleNet::Run pure. And then we
also avoid extra virtual calls / empty vector iterations.

Reviewed By: ilia-cher

Differential Revision: D6551440

fbshipit-source-id: c97a732a00bb36eed49d35e727156ce94225a08b
2017-12-14 11:20:20 -08:00
Alexander Sidorov
3de8661184 Disable SDT calls for all nets by default
Summary:
We see a non trivial overhead because of this debugging
code. I talked with Romain and looks like we can comment this out for
now. We will think about better way to integrate this kind of
functionality in Caffe2 going forward

Reviewed By: romain-intel, pietern

Differential Revision: D6551108

fbshipit-source-id: efa3e643b953d33dc5f3d11f88cafdf2730bc4e4
2017-12-13 21:33:08 -08:00
Ilia Cherniavskii
1149b9bbb5 Polling async net executor
Summary:
Implementation of polling async net executor.
Notes:
- New net executor async_polling - schedules CPU and GPU ops asynchronously, uses single polling thread
- Events: update to Caffe2 events to support async CPU events, adding new methods:
 Query() - non-blocking checking of event states: INITIALIZED -> RECORDED -> SUCCESS/FAILED
 ErrorMessage() - when operation runs asynchronously and fails calling this on event will give error message
- Tasks: using existing DAGNet's algorithm to compute CPU and GPU chains, a separate task for each chain
- Polling: using single thread to query state of events - for CPU tasks atomically queries task state, for GPU task - uses cudaEventQuery; using Event
- Scheduling of CPU ops: using global thread pools
- Scheduling of GPU ops: using GPU thread pool per GPU device

Reviewed By: dzhulgakov

Differential Revision: D5985110

fbshipit-source-id: a9de7fcbb71d046a3aa1b573072b89a65dfeee8c
2017-11-03 07:27:44 -07:00
Bram Wasti
7d16d320d5 expose observers to python, add multiple observers per observable
Summary: observer framework can now be used in python + a small writeup of how to use it.  this is D6035393 with a fix for ct-scan

Reviewed By: salexspb

Differential Revision: D6066380

fbshipit-source-id: 896c4c580d4387240b81ac2dbbc43db51d4bfeb9
2017-10-16 14:32:56 -07:00
Scott Yost
a7a81351f2 Revert D6035393: [caffe2] expose observers to python, add multiple observers per observable
Summary:
This reverts commit 4563cf0203095fa979bb2160621cd16dd22ff830

bypass-lint

Differential Revision: D6035393

fbshipit-source-id: 090fba774ce433904f7ef769dda75c2fbbf784a8
2017-10-14 21:47:34 -07:00
Bram Wasti
58fe66e337 expose observers to python, add multiple observers per observable
Summary: observer framework can now be used in python + a small writeup of how to use it

Reviewed By: sf-wind

Differential Revision: D6035393

fbshipit-source-id: 4563cf0203095fa979bb2160621cd16dd22ff830
2017-10-14 13:09:29 -07:00
Yangqing Jia
b1508e8e86 Revert D5905002: [caffe2] expose observers to python
Summary:
This reverts commit e40ec24a55e08fb73beea9b4f3b68e71fc66ffb1

bypass-lint

Differential Revision: D5905002

fbshipit-source-id: 4f1b79d9a318978f6b74565f633f34b9701a9d5c
2017-10-10 22:12:00 -07:00
Bram Wasti
63caca89db expose observers to python
Summary: observer framework can now be used in python + a small writeup of how to use it

Reviewed By: salexspb

Differential Revision: D5905002

fbshipit-source-id: e40ec24a55e08fb73beea9b4f3b68e71fc66ffb1
2017-10-10 16:10:41 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Luke Yeager
f841446fbb Formatting fix for verbose net logging
Summary:
This doesn't look quite right:
`I0915 21:26:03.910737    19 net_simple.cc:24] Creating operator :ConstantFill`
Closes https://github.com/caffe2/caffe2/pull/1218

Differential Revision: D5888865

Pulled By: Yangqing

fbshipit-source-id: 7db5059fd952c200a11fdcf01126e43497565116
2017-09-21 20:13:14 -07:00
Romain Cledat
77ea40c01a Added USDT sample points to simple net
Summary:
This enables opsnoop to work with simple net as opposed
to just dag net

Reviewed By: pietern

Differential Revision: D5721732

fbshipit-source-id: c38d0b51d3b0469ecb2883e7075eeee7acf81d75
2017-09-13 19:10:16 -07:00
Yangqing Jia
95c954abc0 redesigning NetBase's Run() and RunAsync() functionalities
Summary:
Right now, each net implements 2 functions: Run() and RunAsync(). The (loose) abstraction is:

* Run(): run the network in a synchronous way. The call is synchronous.
* RunAsync(): run the network *still synchronously*, but potentially use asynchronous scheduling of the underlying operators.

As one can see, this is highly confusing: RunAsync() is actually a sync call, and the semantics it tries to implement should actually be done by a different net type. For example, DAGNet and AsyncDAGNet both implement the Run() function, and under the hood one uses sync scheduling and one uses async scheduling. Currently, the only user of the RunAsync() function is in SimpleNet::RunAsync(). The only call site is in recurrent_net_op.

Instead, the operator implements the two Run() and RunAsync() functions as follows:

* Run(): run the operator in a synchronous way. aka doing FinishDeviceComputation().
* RunAsync(): run the operator in an asynchronous way if possible (i.e. still sync in CPU, but async in cuda), records the action in the event_, and return immediately.

Semantically, Run() is equal to RunAsync() followed by event().Finish().

As a result, we propose in diff D5812854 to change the network interface similar to the operator interface, and explicitly raise RunAsync() as a first class citizen of the net interface. Specifically, whether a net can run asynchronously is now determined by the

* Adding a SupportsAsync() function that determines if a net supports async execution or not.
* Run(): run the net in a synchronous way.
* RunAsync(): if SupportsAsync() is false, same as Run(). if SupportsAsync() is true, run the operator in an asynchronous way, with the scheduling algorithm determined by the implementation itself. Then, record all outstanding events in the events_ field, and return immediately.

Semantically, Run() is equal to RunAsync, and call event.Finish() for all the events. This is actually the implementation and Run() is no longer a virtual function, RunAsync() is: all sub classes of NetBase shall implement SupportsAsync() and RunAsync() now.

**Why SupportsAsync()?**

This is a design idea that probably needs iterating. Basically, the idea is that RunAsync() is the main entry for the net execution, and it's actually like RunAsyncIfTheNetSupportsIt().

In theory, Run() is basically a wrapper on top of RunAsync() to reduce code duplication: if a net type does not support RunAsync(), its RunAsync() implementation simply is sync (see e.g. SimpleNet) and the Run() to RunAsync() lowering is a no-op (with the only overhead being a nested function call).

I exposed the SupportsAsync() function just in case some caller wants to explicitly check whether an instantiated net supports async call or not - for example, a caller may want to make sure that it is actually running a net asynchronously, in which case SupportsAsync() is the place to query.

Reviewed By: dzhulgakov

Differential Revision: D5812854

fbshipit-source-id: 916b38fded0eb14439f340ab254a034ac5a9a465
2017-09-13 00:02:20 -07:00
Bram Wasti
c609f22638 added gflop annotation to TEST_benchmark
Summary: TEST_benchmark will print out gflops if it can infer them

Reviewed By: Maratyszcza

Differential Revision: D5412644

fbshipit-source-id: 3af7bb42cda4684e30db6d8ae5484d441898479c
2017-08-31 14:18:20 -07:00
Yangqing Jia
65112f3865 code cleanup: separate the several net implementations to separate files.
Summary: TSIA.

Reviewed By: harouwu

Differential Revision: D5670906

fbshipit-source-id: 507e789978144341bf696fb20dc11f3c2d55493b
2017-08-21 22:07:48 -07:00