Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53319
Noticed these in profiles.
Also switch to `unordered_map`.
Test Plan: Unit tests.
Reviewed By: swolchok
Differential Revision: D26504408
fbshipit-source-id: 9e14d55909a4af019058b8c27c67ee2348cd02a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39493
Make sure we wait for all types, incl. async cpu ops
Test Plan: CI
Reviewed By: kennyhorror
Differential Revision: D21873540
fbshipit-source-id: 37875cade68e1b3323086833f8d4db79362a68e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10766
Added a `Workspace::ForEach(...)` API for accessing the global set of
existing Workspace instances. This is used in the signal handler to print blob
info on the thread receiving a fatal signal.
Reviewed By: mraway
Differential Revision: D9147768
fbshipit-source-id: a94d0b5e6c88390a969ef259ecb8790173af01a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10217
It's only used in debug printing and is not that reliable anyway. If we want to implement it later - we should do it proper accounting for shared storages.
Reviewed By: jerryzh168
Differential Revision: D9155685
fbshipit-source-id: 48320d41a0c4155645f3ba622ef88730a4567895
Summary: Adds support for backprop to While op, fixes gradient computation for Pow
Reviewed By: azzolini
Differential Revision: D6456875
fbshipit-source-id: 9f660317ad6f3898ff7d8ce43098f85c3426409b
Summary:
pretransposing FCs seems to offset loses we get from low
batch sizes in AdIndexer. First I confirmed this on local benchmarks (see
previous diff). Then in https://fburl.com/yuo49onj I showed how this
change saves 19% of FC time on AdIndexer. Which is already $0.4M in
cap. exp. and over 3 years gives 5x more ROI.
We also we reuse this code for later more efficient gemm
implementations. I.e. msmelyan is working on new fp16 gemm which
would cut bandwidth usage 2x. We can reuse code in this diff for
repacking required by a new gemm.
In this diff I had to take care of memory usage. Here are several
possible approaches to the transformation:
1. Perform on the fly, copy the memory. This is what is done in
skinny gemm (FC with engine SKINNY)
Cons: slow first execution, memory is replicated for each thread
2. Perform copy of weights in operator constructor. On the fly in dbg
mode verify that hash on original weight is the same
Cons: memory is still replicated for each thread
3. Perform copy weights in Predictor constructor
Cons: if we have 2 predictors sharing the same weight blob (via
PredictorContainer), we still get 3x more memory. I.e. original
weights and two copies for each of the predictors in a container
4. Replace weights in Predictor constructor, take care of mapping to
support weight sharing within a Predictor container
This is the approach taken in this diff, it solves issues above and
doensn't create any memory overhead.
Cons: Logic became complex, requires a mutex at initialization time
Reviewed By: akyrola
Differential Revision: D6214593
fbshipit-source-id: 25da6ba7bfd39fc8f4b578094d3f334c7957490d
Summary: Adds the ability to create a local blob in the workspace even if the blob exists in the parent workspace. This is to support cases where a user wants to create a local copy of the blob and hide the blob from the parent workspace.
Reviewed By: akyrola
Differential Revision: D6194386
fbshipit-source-id: 92c064159ac635ee76c211abc013b72bd8752447
Summary:
Adding backward pass support for If operator:
- Implemented necessary changes to Do operator and generation of gradient Do operator to properly forward gradient blobs in and out of subnet
- Using WorkspaceManager to keep track of workspaces used by Do, in case we need to have access to local blobs to compute gradients (also important for loop's backprop)
- Update to Workspace to handle blob binding from multiple parent workspaces
- Implemented generation of gradient If operator
- Unit test to build and train a net with If control op
Reviewed By: azzolini
Differential Revision: D5745096
fbshipit-source-id: 1023c90a2113716254424d1e50b9e560fe9083e5
Summary:
Better isolation for workspaces to allow forwarding selected blobs
from parent to child workspace, possibly under new names. Used for proper
isolation of subnets (loops, then/else branhes, etc) from outer workspace.
Reviewed By: azzolini
Differential Revision: D5681667
fbshipit-source-id: e61a2c7c98ee2abf1f0761905f4bfae47c201c32
Summary:
running ##xplat/caffe2/fb_sync.sh##.
Also add two new core sources to the BUCK file, and add ##createSharedBuffer## to NNPACKConvOp.
Reviewed By: ajtulloch
Differential Revision: D5373061
fbshipit-source-id: c030b2629d2715e1d2776c98715f57e2650922c9
Summary: Rather chunky sync of changes made exclusively to mobile codebases back to fbcode.
Reviewed By: ajtulloch
Differential Revision: D5314405
fbshipit-source-id: c4d0a7244468f953eb63288306bc9bc78eb9e1be
Summary:
Quite common, hard-to-debug, performance bug for multi-GPU training has been that operators have been passed tensors that reside on different GPU than what the op runs on. Since we have peer access enabled, this works, but is just much slower. With data parallel model this problem arises rarely as it has static analysis of the operators, but if someone bypassed DPM or uses FeedBlob with incorrect device options, this problem can happen.
To make debugging easier, I added device-field to tensor that stores the device information that allocated the memory. In addition, I added a function to go through operator inputs and outputs and compare their tensor device to the operator device. This check is run after first iteration with prof_dag only.
Also renamed ShapeCall to TensorInfoFun, as it now returns so much other info than the shape.
I think this is pretty safe diff, but do you find it problematic to add a new field to tensor?
Reviewed By: dzhulgakov
Differential Revision: D5335505
fbshipit-source-id: 511b6c122dff9a205f43951984868ffd40f7ac30
Summary: This RunPlan is getting complex and confusing. The first step to clean it up is to move it out of workspace.cc to better mark separation of concerns.
Reviewed By: kennyhorror
Differential Revision: D5100721
fbshipit-source-id: 4be0559eba1abb8bb1ddc3818698763c2e014ef2
Summary:
This is preamble for the "diagonal executor". Instead of creating a Net for each timestep, we have a single executor for the RecurrentNetworkOp that manages ops per timestep.
This will be used if net_type='rnn', so one can still use the old way by using a net type of 'simple' or 'dag' (so there is effective kill-switch if there are some issues with this).
Did this only for the forward-model. Gradient op will follow later on, but it is basically similar, just reverse order.
Reviewed By: salexspb
Differential Revision: D4979933
fbshipit-source-id: bda77918ec518cb6b29d7021ee036d59eb2dd303
Summary:
This is from discussion with dzhulgakov : as a step towards revisiting the
core.Net autonaming, we will first guard against accidental overwrites of
existing networks in the workspace.
ajtulloch since we are doing Predictors in mobile, this should be safe right?
azzolini - I assume this would be safe, but would love to get your approval.
akyrola - would this hurt xray?
Reviewed By: dzhulgakov
Differential Revision: D4897725
fbshipit-source-id: aa41271927ad6671f07a53b9505283623f8c49e5
Summary: Instead of reporting the number of total elements of tensor, report the number of bytes. But report the capacity of the tensor, not the current number of bytes.
Reviewed By: jamesr66a, salexspb
Differential Revision: D4851633
fbshipit-source-id: 464d552f41f1b5f25753b0e7001d299b6dac1966
Summary:
Added Caffe2 cmd line option --caffe2_print_blob_sizes_at_exit=1, that when enabled, will print all tensor sizes at the workspace destructor. Handy especially when using sub-workspaces like with RNNs. Note that the sizes are number of elements, not bytes. Output is designed to be easily excel-copypasteable.
TODO: add sorting
Reviewed By: jamesr66a
Differential Revision: D4844628
fbshipit-source-id: 11608a1710ae5c89bbd741edb506d25496606185
Summary:
When the execution step is representing things like:
for loop
execution_step
net1
execution_step
net2
net3
the preparation cost for execution step is too high.
This diff moves most of the shared information in the CompiledExecutionStep to save time.
After the change the benchmark result for parameter server handler is as following: (be aware that the first two have some variance)
INFO:__main__:==Summary==
INFO:__main__:Time <function case_if at 0x7f7160c32938> 0.0752924203873
INFO:__main__:Time <function case_loop at 0x7f7160c329b0> 0.0677666187286
INFO:__main__:Time <function case_simple_net at 0x7f7160c32a28> 0.0605396509171
INFO:__main__:Time <function case_one_loop at 0x7f7160c32aa0> 0.0611681699753
Before the change:
INFO:main:==Summary==
INFO:main:Time <function case_if at 0x7f19d079f848> 0.100815701485
INFO:main:Time <function case_loop at 0x7f19d079f8c0> 0.0864136457443
INFO:main:Time <function case_simple_net at 0x7f19d079f938> 0.0614696979523
INFO:main:Time <function case_one_loop at 0x7f19d079f9b0> 0.0598972082138
Reviewed By: azzolini
Differential Revision: D4643926
fbshipit-source-id: 5a4b97230ba778e0ff5cbafc8a216335a191068a
Summary:
Previously we had several limitations for a reporter net:
- needed to be a net, not an execution step
- only one allowed per execution step, with a single interval
Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts.
Reviewed By: dzhulgakov
Differential Revision: D4583686
fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d
Summary:
We get fluky lstm tests on a numerical gradient check. I
would like to improve accuracy of the latter. But first need an
example. After lading this TestWarden would find a bad input for me.
Reviewed By: urikz
Differential Revision: D4467223
fbshipit-source-id: 68d4bf22af11190f39fa28332c6d99efbb192132
Summary:
The old heuristic functioned badly on octa-core phones (e.g., the S6). Limiting the number of threads to 4 in the 8 core case seemed to give optimum performance. For 4 cores, 3 threads still seems to yield best performance, as does 2 threads for 2 cores in the iOS phones, though those cores are very different than the typical ARM cores in Android phones.
I figure at the limit, we should limit ourselves to half the cores available, especially since in a big.LITTLE configuration, only half the cores are likely to be big.
I need to get my hands on a deca-core phone or tablet to try out this heuristic, but I certainly figure that this will function better than what we had before (which would be 9 threads on a 10 core device).
Reviewed By: ajtulloch
Differential Revision: D4220341
fbshipit-source-id: 06fa7677789fcdbec03d98bb85a565f1d22099e1
(1) various bugfixes.
(2) Tensor is now a class independent from its data type. This allows us
to write easier type-independent operators.
(3) code convention changes a bit: dtype -> T, Tensor<*Context> -> Tensor* alias.
(4) ParallelNet -> DAGNet to be more consistent with what it does.
(5) Caffe's own flags library instead of gflags.
(6) Caffe's own logging library instead of glog, but glog can be chosen with
compile-time definition -DCAFFE2_USE_GOOGLE_GLOG. As a result, glog macros
like CHECK, DCHECK now have prefix CAFFE_, and LOG(*) now becomes
CAFFE_LOG_*.
(7) an optional protobuf inclusion, which can be chosen with USE_SYSTEM_PROTOBUF
in build_env.py.