Commit Graph

45 Commits

Author SHA1 Message Date
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Giuseppe Ottaviano
69bb0e0285 [caffe2] Avoid some double (and triple) lookups in workspace (#53319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53319

Noticed these in profiles.

Also switch to `unordered_map`.

Test Plan: Unit tests.

Reviewed By: swolchok

Differential Revision: D26504408

fbshipit-source-id: 9e14d55909a4af019058b8c27c67ee2348cd02a9
2021-03-04 22:57:02 -08:00
Ilia Cherniavskii
01986e9890 Wait for all op types in SimpleNet (#39493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39493

Make sure we wait for all types, incl. async cpu ops

Test Plan: CI

Reviewed By: kennyhorror

Differential Revision: D21873540

fbshipit-source-id: 37875cade68e1b3323086833f8d4db79362a68e8
2020-06-11 13:00:34 -07:00
Yangqing Jia
38f3d1fc40 move flags to c10 (#12144)
Summary:
still influx.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12144

Reviewed By: smessmer

Differential Revision: D10140176

Pulled By: Yangqing

fbshipit-source-id: 1a313abed022039333e3925d19f8b3ef2d95306c
2018-10-04 02:09:56 -07:00
Edward Yang
91797c0672 Replace direct include of caffe2.pb.h with an intermediary header caffe2_pb.h (#10946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10946

```
codemod -d . --extensions cc,cpp,cu,cuh,h caffe2/proto/caffe2.pb.h caffe2/proto/caffe2_pb.h
```

Reviewed By: houseroad

Differential Revision: D9539945

fbshipit-source-id: 497d04720e8e7e61c05ffe1b23733d0cb774de7e
2018-08-28 11:57:08 -07:00
Andrei Maximov
432b3adffc Print blob sizes on fatal signal (#10766)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10766

Added a `Workspace::ForEach(...)` API for accessing the global set of
existing Workspace instances. This is used in the signal handler to print blob
info on the thread receiving a fatal signal.

Reviewed By: mraway

Differential Revision: D9147768

fbshipit-source-id: a94d0b5e6c88390a969ef259ecb8790173af01a4
2018-08-23 13:39:55 -07:00
Dmytro Dzhulgakov
7bc87172ea Kill Tensor::shares_data (#10217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10217

It's only used in debug printing and is not that reliable anyway. If we want to implement it later - we should do it proper accounting for shared storages.

Reviewed By: jerryzh168

Differential Revision: D9155685

fbshipit-source-id: 48320d41a0c4155645f3ba622ef88730a4567895
2018-08-03 17:40:39 -07:00
Orion Reblitz-Richardson
1d5780d42c Remove Apache headers from source.
* LICENSE file contains details, so removing from individual source files.
2018-03-27 13:10:18 -07:00
Marat Dukhan
224493d9ce NNPACK: Use new bindings and custom thread pool
Summary:
This change should dramatically (~10X) improve performance of convolution with NNPACK engine
Closes https://github.com/caffe2/caffe2/pull/1730

Reviewed By: sf-wind

Differential Revision: D6695895

Pulled By: Maratyszcza

fbshipit-source-id: 26291916811ef4cb819a59aec848c4e23668e568
2018-01-11 10:48:12 -08:00
Ilia Cherniavskii
d28720b90a Backpropagation for While op
Summary: Adds support for backprop to While op, fixes gradient computation for Pow

Reviewed By: azzolini

Differential Revision: D6456875

fbshipit-source-id: 9f660317ad6f3898ff7d8ce43098f85c3426409b
2017-12-18 16:03:45 -08:00
Alexander Sidorov
bfdd864631 Automatically pretranspose FCs in BlackBoxPredictor
Summary:
pretransposing FCs seems to offset loses we get from low
batch sizes in AdIndexer. First I confirmed this on local benchmarks (see
previous diff). Then in https://fburl.com/yuo49onj I showed how this
change saves 19% of FC time on AdIndexer. Which is already $0.4M in
cap. exp. and over 3 years gives 5x more ROI.

We also we reuse this code for later more efficient gemm
implementations. I.e. msmelyan is working on new fp16 gemm which
would cut bandwidth usage 2x. We can reuse code in this diff for
repacking required by a new gemm.

In this diff I had to take care of memory usage. Here are several
possible approaches to the transformation:

1. Perform  on the fly, copy the memory. This is what is done in
skinny gemm (FC with engine SKINNY)

Cons: slow first execution, memory is replicated for each thread

2. Perform copy of weights in operator constructor. On the fly in dbg
mode verify that hash on original weight is the same

Cons: memory is still replicated for each thread

3. Perform copy weights in Predictor constructor

Cons: if we have 2 predictors sharing the same weight blob (via
PredictorContainer), we still get 3x more memory. I.e. original
weights and two copies for each of the predictors in a container

4. Replace weights in Predictor constructor, take care of mapping to
support weight sharing within a Predictor container

This is the approach taken in this diff, it solves issues above and
doensn't create any memory overhead.

Cons: Logic became complex, requires a mutex at initialization time

Reviewed By: akyrola

Differential Revision: D6214593

fbshipit-source-id: 25da6ba7bfd39fc8f4b578094d3f334c7957490d
2017-11-09 17:35:32 -08:00
Hassan Eslami
5388948b59 CreateLocalBlob for workspace
Summary: Adds the ability to create a local blob in the workspace even if the blob exists in the parent workspace. This is to support cases where a user wants to create a local copy of the blob and hide the blob from the parent workspace.

Reviewed By: akyrola

Differential Revision: D6194386

fbshipit-source-id: 92c064159ac635ee76c211abc013b72bd8752447
2017-11-01 21:32:47 -07:00
Yangqing Jia
8286ce1e3a Re-license to Apache
Summary: Closes https://github.com/caffe2/caffe2/pull/1260

Differential Revision: D5906739

Pulled By: Yangqing

fbshipit-source-id: e482ba9ba60b5337d9165f28f7ec68d4518a0902
2017-09-28 16:22:00 -07:00
Ilia Cherniavskii
f8f5e79f5f Backpropagation for If operator
Summary:
Adding backward pass support for If operator:
 - Implemented necessary changes to Do operator and generation of gradient Do operator to properly forward gradient blobs in and out of subnet
 - Using WorkspaceManager to keep track of workspaces used by Do, in case we need to have access to local blobs to compute gradients (also important for loop's backprop)
 - Update to Workspace to handle blob binding from multiple parent workspaces
 - Implemented generation of gradient If operator
 - Unit test to build and train a net with If control op

Reviewed By: azzolini

Differential Revision: D5745096

fbshipit-source-id: 1023c90a2113716254424d1e50b9e560fe9083e5
2017-09-18 16:17:42 -07:00
Ilia Cherniavskii
67a55b81e3 Forward blobs into workspace
Summary:
Better isolation for workspaces to allow forwarding selected blobs
from parent to child workspace, possibly under new names. Used for proper
isolation of subnets (loops, then/else branhes, etc) from outer workspace.

Reviewed By: azzolini

Differential Revision: D5681667

fbshipit-source-id: e61a2c7c98ee2abf1f0761905f4bfae47c201c32
2017-08-22 18:45:56 -07:00
Jon Morton
9349dab8a0 Full sync of fbcode to fbobjc/fbandroid
Summary:
running ##xplat/caffe2/fb_sync.sh##.
Also add two new core sources to the BUCK file, and add ##createSharedBuffer## to NNPACKConvOp.

Reviewed By: ajtulloch

Differential Revision: D5373061

fbshipit-source-id: c030b2629d2715e1d2776c98715f57e2650922c9
2017-07-31 17:38:38 -07:00
Jon Morton
9b9df3fbeb Sync mobile codebase changes back to fbcode
Summary: Rather chunky sync of changes made exclusively to mobile codebases back to fbcode.

Reviewed By: ajtulloch

Differential Revision: D5314405

fbshipit-source-id: c4d0a7244468f953eb63288306bc9bc78eb9e1be
2017-07-18 17:54:41 -07:00
Junjie Bai
5881aa0a78 Use shared_ptr to share OperatorDef across threads
Reviewed By: akyrola

Differential Revision: D5434291

fbshipit-source-id: 89f470d1e2dcde36c3273d86565b1952d7682808
2017-07-17 23:49:59 -07:00
Aapo Kyrola
d43b42fb37 allow querying tensor device + tool to validate that all ops have tensors from correct devices (GPUs)
Summary:
Quite common, hard-to-debug, performance bug for multi-GPU training has been that operators have been passed tensors that reside on different GPU than what the op runs on. Since we have peer access enabled, this works, but is just much slower. With data parallel model this problem arises rarely as it has static analysis of the operators, but if someone bypassed DPM or uses FeedBlob with incorrect device options, this problem can happen.

To make debugging easier, I added device-field to tensor that stores the device information that allocated the memory. In addition, I added a function to go through operator inputs and outputs and compare their tensor device to the operator device. This check is run after first iteration with prof_dag only.

Also renamed ShapeCall to TensorInfoFun, as it now returns so much other info than the shape.

I think this is pretty safe diff, but do you find it problematic to add a new field to tensor?

Reviewed By: dzhulgakov

Differential Revision: D5335505

fbshipit-source-id: 511b6c122dff9a205f43951984868ffd40f7ac30
2017-07-01 09:16:37 -07:00
Alisson Gusatti Azzolini
db1d62caf7 Move RunPlan to a separate file
Summary: This RunPlan is getting complex and confusing. The first step to clean it up is to move it out of workspace.cc to better mark separation of concerns.

Reviewed By: kennyhorror

Differential Revision: D5100721

fbshipit-source-id: 4be0559eba1abb8bb1ddc3818698763c2e014ef2
2017-05-24 11:07:15 -07:00
Aapo Kyrola
c86610b738 special executor class for RecurrentNetworks (just single threaded now)
Summary:
This is preamble for the "diagonal executor". Instead of creating a Net for each timestep, we have a single executor for the RecurrentNetworkOp that manages ops per timestep.
This will be used if net_type='rnn', so one can still use the old way by using a net type of 'simple' or 'dag' (so there is effective kill-switch if there are some issues with this).

Did this only for the forward-model. Gradient op will follow later on, but it is basically similar, just reverse order.

Reviewed By: salexspb

Differential Revision: D4979933

fbshipit-source-id: bda77918ec518cb6b29d7021ee036d59eb2dd303
2017-05-01 19:06:25 -07:00
Yangqing Jia
cf317d1106 create_net: explicitly specify if one wants to overwrite the network.
Summary:
This is from discussion with dzhulgakov : as a step towards revisiting the
core.Net autonaming, we will first guard against accidental overwrites of
existing networks in the workspace.

ajtulloch since we are doing Predictors in mobile, this should be safe right?

azzolini - I assume this would be safe, but would love to get your approval.

akyrola - would this hurt xray?

Reviewed By: dzhulgakov

Differential Revision: D4897725

fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5
2017-04-17 21:46:53 -07:00
Aapo Kyrola
a2065f3c1e report capacity bytes as part of workspace blob stats
Summary: Instead of reporting the number of total elements of tensor, report the number of bytes. But report the capacity of the tensor, not the current number of bytes.

Reviewed By: jamesr66a, salexspb

Differential Revision: D4851633

fbshipit-source-id: 464d552f41f1b5f25753b0e7001d299b6dac1966
2017-04-07 19:16:37 -07:00
Aapo Kyrola
ffd298376a option to print tensor shapes at exit
Summary:
Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable.

TODO: add sorting

Reviewed By: jamesr66a

Differential Revision: D4844628

fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185
2017-04-06 21:36:04 -07:00
Ou Jin
eeb7279020 compile execution step
Summary:
When the execution step is representing things like:
for loop
  execution_step
     net1
  execution_step
     net2
     net3
the preparation cost for execution step is too high.
This diff moves most of the shared information in the CompiledExecutionStep to save time.

After the change the benchmark result for parameter server handler is as following: (be aware that the first two have some variance)
INFO:__main__:==Summary==
INFO:__main__:Time <function case_if at 0x7f7160c32938> 0.0752924203873
INFO:__main__:Time <function case_loop at 0x7f7160c329b0> 0.0677666187286
INFO:__main__:Time <function case_simple_net at 0x7f7160c32a28> 0.0605396509171
INFO:__main__:Time <function case_one_loop at 0x7f7160c32aa0> 0.0611681699753

Before the change:
INFO:main:==Summary==
INFO:main:Time <function case_if at 0x7f19d079f848> 0.100815701485
INFO:main:Time <function case_loop at 0x7f19d079f8c0> 0.0864136457443
INFO:main:Time <function case_simple_net at 0x7f19d079f938> 0.0614696979523
INFO:main:Time <function case_one_loop at 0x7f19d079f9b0> 0.0598972082138

Reviewed By: azzolini

Differential Revision: D4643926

fbshipit-source-id: 5a4b97230ba778e0ff5cbafc8a216335a191068a
2017-03-08 23:49:41 -08:00
Alisson Gusatti Azzolini
8fa156d082 Improve "reporter net" design
Summary:
Previously we had several limitations for a reporter net:
 - needed to be a net, not an execution step
 - only one allowed per execution step, with a single interval

Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.

Reviewed By: dzhulgakov

Differential Revision: D4583686

fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
2017-02-21 20:17:40 -08:00
Alexander Sidorov
8bff8014b3 print out inputs in lstm test to catch when it is fluky
Summary:
We get fluky lstm tests on a numerical gradient check. I
would like to improve accuracy of the latter. But first need an
example. After lading this TestWarden would find a bad input for me.

Reviewed By: urikz

Differential Revision: D4467223

fbshipit-source-id: 68d4bf22af11190f39fa28332c6d99efbb192132
2017-01-25 20:59:21 -08:00
Liang Xiong
1aafeb3565 clean up memory of c2/sigrid predictor
Summary: trying to optimize c2 predictor memory usage. mainly to remove unsed dbreader and dper metadata.

Differential Revision: D4232595

fbshipit-source-id: dcd7aa7dd09587ec9811a9e5ec725e0c22757665
2016-11-29 15:18:39 -08:00
Jeff Johnson
da7add3da8 Better threadpool sizing heuristics
Summary:
The old heuristic functioned badly on octa-core phones (e.g., the S6). Limiting the number of threads to 4 in the 8 core case seemed to give optimum performance. For 4 cores, 3 threads still seems to yield best performance, as does 2 threads for 2 cores in the iOS phones, though those cores are very different than the typical ARM cores in Android phones.

I figure at the limit, we should limit ourselves to half the cores available, especially since in a big.LITTLE configuration, only half the cores are likely to be big.

I need to get my hands on a deca-core phone or tablet to try out this heuristic, but I certainly figure that this will function better than what we had before (which would be 9 threads on a 10 core device).

Reviewed By: ajtulloch

Differential Revision: D4220341

fbshipit-source-id: 06fa7677789fcdbec03d98bb85a565f1d22099e1
2016-11-29 15:18:37 -08:00
Yangqing Jia
589398950f fbsync at f5a877 2016-11-18 15:41:06 -08:00
Yangqing Jia
238ceab825 fbsync. TODO: check if build files need update. 2016-11-15 00:00:46 -08:00
Yangqing Jia
d1e9215184 fbsync 2016-10-07 13:08:53 -07:00
Yangqing Jia
b23e51d467 chunky sync 2016-09-06 15:55:19 -07:00
Yangqing Jia
05512d1e10 sync 2016-08-10 11:02:15 -07:00
Yangqing Jia
bcea409c82 sync 2016-07-28 15:06:43 -07:00
Yangqing Jia
6463eebc7b chunky sync - build scripts to be written 2016-07-21 10:16:42 -07:00
Yangqing Jia
559053d3a8 chunky sync 2016-05-13 14:43:48 -07:00
Yangqing Jia
4f2530d8ce expose benchmark code to python 2015-12-15 20:42:54 -08:00
Yangqing Jia
73f3daf736 minor bugfix for workspace 2015-12-13 08:37:36 -08:00
Yangqing Jia
a3dcd9250a bugfix 2015-11-10 23:11:05 -08:00
Yangqing Jia
d734ddc196 Adding optional Eigen code. Added a switch USE_SYSTEM_EIGEN in Env. Misc changes. 2015-10-18 16:55:24 -07:00
Yangqing Jia
648d1b101a A consolidation of a couple random weekend work.
(1) various bugfixes.
(2) Tensor is now a class independent from its data type. This allows us
    to write easier type-independent operators.
(3) code convention changes a bit: dtype -> T, Tensor<*Context> -> Tensor* alias.
(4) ParallelNet -> DAGNet to be more consistent with what it does.
(5) Caffe's own flags library instead of gflags.
(6) Caffe's own logging library instead of glog, but glog can be chosen with
    compile-time definition -DCAFFE2_USE_GOOGLE_GLOG. As a result, glog macros
    like CHECK, DCHECK now have prefix CAFFE_, and LOG(*) now becomes
    CAFFE_LOG_*.
(7) an optional protobuf inclusion, which can be chosen with USE_SYSTEM_PROTOBUF
    in build_env.py.
2015-10-11 23:14:06 -07:00
Yangqing Jia
036229c889 [style] Finishing name changes for the rest of the fields in the protobuf. 2015-07-01 18:16:43 -07:00
Yangqing Jia
2ed1077a83 A clean init for Caffe2, removing my earlier hacky
commits.
2015-06-25 16:26:01 -07:00