Summary: Rename static tracepoint macros to better describe their targeted usage.
Test Plan:
Same as for D47159249:
Tested the following macros on test scripts with libbpf USDTs:
* `CAFFE_SDT`
* `CAFFE_DISABLE_SDT`
* `CAFFE_SDT_WITH_SEMAPHORE`
Reviewed By: chaekit
Differential Revision: D47727339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106380
Approved by: https://github.com/chaekit
Summary: Moving static tracepoint macros header to a location where it can be easily used by various PyTorch components (`c10/utill`).
Test Plan:
Same as for D47159249:
Tested the following macros on test scripts with libbpf USDTs:
* `CAFFE_SDT`
* `CAFFE_DISABLE_SDT`
* `CAFFE_SDT_WITH_SEMAPHORE`
Reviewed By: EDG-GH
Differential Revision: D47636258
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105856
Approved by: https://github.com/EDG-GH, https://github.com/chaekit
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67436
This information is useful for comparing static runtime to c2
Reviewed By: d1jang
Differential Revision: D31991571
fbshipit-source-id: eb83bc4564b05d56fb9a550863eea3f6312f3f6c
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52903
Implement BlackBoxPredictor::BenchmarkIndividualOps so that we can clean up the output tensors properly after each iteration and get more accurate per operator timing.
Add four more metrics to track setup_time, memory_alloc_time, memory_dealloc_time, and output_dealloc_time.
Reviewed By: ajyu
Differential Revision: D26657473
fbshipit-source-id: 1cf282192b531513b9ee40b37252087818412f81
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39493
Make sure we wait for all types, incl. async cpu ops
Test Plan: CI
Reviewed By: kennyhorror
Differential Revision: D21873540
fbshipit-source-id: 37875cade68e1b3323086833f8d4db79362a68e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25908
Original commit changeset: f6e961e88c01
device_option propagation is completely broken in Caffe2 for cases when pass through operators are used. As an example Gather operator don't have gradient and passes through it's inputs, which results in incorrect detection of the components for sparse parameter aggregation (component will be empty instead of the real device).
This diff is trying to fix this issue.
Original diff had a problem, that Caffe2 is not handling cases when device option is present, but contains only metadata (for example one for auto-generated reduction ops in backward pass). This diff is addressing this issue by merging device options during the backward pass
Test Plan:
1. net_transform is finally working with Gather + FloatToHalf transformed model instead of failing because of incorrect number of components.
2. New unit-test.
3. Verify that previously broken benchmark is now passing
ezyang do you have suggestions what else I should test?
Reviewed By: ezyang
Differential Revision: D17281528
fbshipit-source-id: 4a1bc386f29f6a34fbf8008effde9d4890abebfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12714
This is a short change to enable c10 namespace in caffe2. We did not enable
it before due to gflags global variable confusion, but it should have been
mostly cleaned now. Right now, the plan on record is that namespace caffe2 and
namespace aten will fully be supersets of namespace c10.
Most of the diff is codemod, and only two places of non-codemod is in caffe2/core/common.h, where
```
using namespace c10;
```
is added, and in Flags.h, where instead of creating aliasing variables in c10 namespace, we directly put it in the global namespace to match gflags (and same behavior if gflags is not being built with).
Reviewed By: dzhulgakov
Differential Revision: D10390486
fbshipit-source-id: 5e2df730e28e29a052f513bddc558d9f78a23b9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9770
Add zero ops to operators that do not have a valid schema
Reviewed By: hlu1
Differential Revision: D8957472
fbshipit-source-id: d8d0a351183e88ace2e050a87c1e1c363af67e33
* Fix some signed/unsigned mismatches
* Skip unused result warning
* Explict fallthrough for murmur hash
* Enable aligned new support to eliminate warning
* Switch to int instead of unsigned in some cases
* Fix handling of empty batches in SumReduceDimsOp
As titled
* Deferrable async_scheduling finishRun fix
Proper order of finishing run operations in deferrable_async_scheduling net
* Simplify exception handling in async_scheduling
Simplify exception handling, no need to busy wait, thread that processes the
last task can finish the run
* [C2]worker_coordinator_memorize_worker_ids
As titled. This is related to T28689868, where the number of blobs we want to create is equal to the number of worker ids
* Add unit test for nets with no type set
* Ignore total length argument in sympolic_pad_packed_sequence
1- There was a mistake in the code that total_length was added to the wrong symbolic function (pack_padded_sequence) instead of (pad_packed_sequence)
2- No need to throw an exception if total_length is given since it is only used to enable data_parallel training on multi-gpus and doesn't have anything to do with onnx export, so just ignore it. https://fburl.com/tk4gciqp
* Add support for MKLDNN to async_scheduling
Just add MKLDNN as a possible CPU option to async_scheduling's pool function
* [AuFL][ensemble] support branch output for prediction
This diff supports using predictions from different branches and thus enables model ensembling (not fully independent).
* Fix a bug in add_loss in layer_model_helper
As titled.
* Support lradaption for adam
1.lr adaption operator
2.apply to dense adam
* Perf tweaks for async_scheduling
Restore single pool option + remove unnecessary (no-ops) calls
* add quantization to SparseSimdAdagradOp
add a bunch of quantization signatures to SparseSimdAdagradOp, implementations to come next
* [sr] [codemod] Change all SR callsites to use new API
@allow-large-files
This diff refactors all callsites of SR to use the slightly changed API introduced in the diff below. Really what this means is that you need to include the correct header. Also if you were using `ClientFactory::newFactory` you need to not prefix it with `ClientFactory::`.
```
cd ~/fbsource/fbcode
find ./ -type f -exec sed -i -e 's:#include "servicerouter/client/cpp2/ClientFactory.h":#include "servicerouter/client/cpp2/ServiceRouter.h":' -e 's:#include <servicerouter/client/cpp2/ClientFactory.h>:#include <servicerouter/client/cpp2/ServiceRouter.h>:' -e 's/ClientFactory::newFactory(/newFactory(/g' {} \;
```
Also manually fixed spots that couldn't be done automatically (or broke because they depended on transitive includes).
* Back out "Fix handling of empty batches in SumReduceDimsOp"
Original commit changeset: 282da1730cc2 This commit is blocking the
Github->fbcode sync, which really needs to get merged ASAP. D7881937 which this
diff depends on will be reverted in the sync D7990948 which causes this to
break. The sync diff cannot be patched with this reversion because it must be
landed against base revision 5c8c099 , and D7881937 must not be included in the
sync diff because it is breaking GPU tests that are not available in sandcastle
: https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-cuda8.0-cudnn6-ubuntu16.04-test/3638/console
for one example.
* Add the flow to support operator benchmark
1) generate model with the operator 2) upload to everstore 3) generate model spec into json file 4) start running the benchmark
* [tum][gpu] Connect DPM trainer with flow and unit tests
This diff:
- Fix some small bugs for Yiming's recent changes to parallelizer, so it suits real use cases.
- Add correct tags to the TUM code, so we can do data parallel transform
- pass extra info when instantiation.
- add unit test for using DPM in TUM model
After this diff, we can do simple box, multi-gpu fully-sync trainer for TUM in Fblearner workflow, but may still need to do speed benchmarking.
* w/o normalized lradaption for adam dense only
The previous lr adaption includes a normalization step when performing the dot product operation. This is not exactly same as what is proposed in the paper. I add normalization as an option. Without it, the operator performs exactly what the paper proposed. With the option, we add the normalization step
* [fb] Use SharedPromise in DeferrableAsyncSchedulingNet
This code is to simplify DeferrableAsyncSchedulingNet by removing condition
variable + small fixes
* [tum] implement cuda sparseLengthsMean and LengthsMean
as title
* Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.
Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.
* Move feature_to_index to FeatureSpec.feature_to_index
move feature_to_index to FeatureSpec.feature_to_index to avoid override other fields
* [Caffe2] Rename bytes_moved to bytes_written
Just a rename in preparation for supporting bytes_read.
* [c2] fix ReduceFrontSumOp for empty case by setting 0
otherwise, it may use the results from last iteration when it's empty batch.
* [Caffe2] [Int8] Improve Intel CPU performance
* [Easy] Improve PrependDim op logging
as titled
* DBFileReader expand db_path using os.path.expanduser(..)
Since there are a lot of possible use cases of `DBFileReader` to read from user home path, like `~/local/sample.db`, I want to save people's trouble of calling `os.path.expanduser(db_path)` themselves.
* [Caffe2] Add bytes_read to cost structure
We're adding analytical read bytes to cost functions. This extends the structure accordingly for all CostInference defined operators.
Additionally, some small bug fixes were performed:
1) Cost functions now extract type information of operands instead of assuming float
* Fix sleef on aarch64 for hhvm
@bypass-lint
Rename flag
* Remove duplicated part in caffe2/ideep/operators/conv_op.cc
should be sync error
* Rename test helper function test_adagrad_sparse_helper to adagrad_sparse_test_helper to avoid confusing pytest
* [fix] Re-enable events in RNN ops
We have earlier added event disabling in RNN ops as back then we didn't use
events, with current use cases this is no longer true
(https://fburl.com/8vd0lp8y)
* use ops with cude impl
* Revert D7729695: [caffe2][fix] Re-enable events in RNN ops
This reverts commit 4b215c7496fb724656ff4c776933a15bdbbcde5e
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [observer] Clean up observer_config.h
#accept2ship
* [1/n] Refactor dataio_test.py
Replace code duplication with a common function
* Add barrier net that runs before training nets
Add a synchonize barrier net that is run before training nets. With this net, shards that are faster will wait for other shards before start training. This reduce chances of the faster shards timing out during GLOO AllReduce.
Removed explicit data_parallel_model.py.synchronize call in holmes workflow. Similar change in speech/asr_training workflow will come in another diff.
* Support the dnnlowp backend in caffe2_benchmark
This is for SHARE operator latency evaluation
* Migrate integral_image_op to main caffe2
migrate integral_image_op(GPU version) given by https://fburl.com/yvqezigi
to caffe2/caffe2/operators and implement its CPU version. Write up a test
using the hypothesis_test mechanism
* [pos_disc, fbcode] Implement unjoined lr loss
As explained in https://our.intern.facebook.com/intern/wiki/Model_Based_Calibration/, when the dataset is an joined data set, where labels might change later, we need to use unjoined logloss.
The implementation is almost the same as in Sigrid (https://fburl.com/1trngsls), where
loss = y (log(p) - log(1-p)) + (1-y)(log(1-p)) = xy - (1-y)x - (1-y)log(1+exp(-x))
For x < 0, to ensure stability and avoid overflow, we reformulate the above exp as
loss = xy - (1-y)x - (1-y)x + (1-y)log(1+exp(x)) = xy + (1-y)log(1+exp(x))
Then the final expression becomes
loss = xy + (y - 1) x (x >= 0) - (1 - y) log(1 + exp(x - 2 x (x >= 0)))
where y is the true label, x is the dot product and p = logistic(x).
This kind of implementation is align with the current implementation of the original cross entropy in
https://phabricator.intern.facebook.com/diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/cross_entropy_op.cc;0bae3b5d0f825897c5e0dd0ff10f489d7271bf25$7-13
* Keep the array to fix the conflict
* [C2] Compute Adagrad effective LR
The AdagradWithLR op outputs an extra blob which is contains the average effective learning rate across all weights in this blob.
* Open-source extractMetaNetDef & runGlobalInitialization, add new Predictor constructor from db file, and add run_map_outputs
1. Open-source extractMetaNetDef and runGlobalInitialization, for use in
2. new Predictor constructor from db file.
3. Add new run function that returns outputs as TensorMap
* Disable eigen cpu
Disable eigen cpu in transpose and reduce
* Introduce request_only/object_only property of ModelLayer
by default this is False
* A simple TC Caffe2 benchmark
We can run tunner, get MappingOptions and then use them to
compare against cuBLAS
currently broken due to LLVM issues. How to run:
hg checkout eec1ab31b59c03b8deded1c755a9abaf8c45be01
add D7401202
add D7434625
add D7506031
add D7540728
buck run @mode/dev-nosan tc/tc/benchmarks_python:caffe2_benchmark
* Move Caffe2 feature_maps_ops to open source
Need feature maps operators in open source project facebookresearch/BlueWhale
* Manually fix the conflicts in channel shuffle op
* Fix the inconsistency between different gh and fbcode
* Skip Adagrad GPU Test (Because some gpu implementation is missing)
* Fix another test to make sure it won't run on gpu when implementation is not available yet
LOG(INFO) can be stripped out at compile-time or disabled at run-time,
but there're hardly use-cases where we want to call TEST_Benchmark,
but don't want to see the result. Additionally, on Android, LOG(INFO)
writes to logcat, which is OK for errors/warnings, but inconvenient
for benchmarking results, as on new phones logcat spawns logs like crazy.
Summary:
There is a lot of bussiness logic around various events in
the base net class. SimpleNet doesn't have to handle those (checked
with ilia-cher). Normally these should be no events registered for
simple nets, but we can have some issues where they will be added, so
its less error prone to just have a SimpleNet::Run pure. And then we
also avoid extra virtual calls / empty vector iterations.
Reviewed By: ilia-cher
Differential Revision: D6551440
fbshipit-source-id: c97a732a00bb36eed49d35e727156ce94225a08b
Summary:
We see a non trivial overhead because of this debugging
code. I talked with Romain and looks like we can comment this out for
now. We will think about better way to integrate this kind of
functionality in Caffe2 going forward
Reviewed By: romain-intel, pietern
Differential Revision: D6551108
fbshipit-source-id: efa3e643b953d33dc5f3d11f88cafdf2730bc4e4
Summary:
Implementation of polling async net executor.
Notes:
- New net executor async_polling - schedules CPU and GPU ops asynchronously, uses single polling thread
- Events: update to Caffe2 events to support async CPU events, adding new methods:
Query() - non-blocking checking of event states: INITIALIZED -> RECORDED -> SUCCESS/FAILED
ErrorMessage() - when operation runs asynchronously and fails calling this on event will give error message
- Tasks: using existing DAGNet's algorithm to compute CPU and GPU chains, a separate task for each chain
- Polling: using single thread to query state of events - for CPU tasks atomically queries task state, for GPU task - uses cudaEventQuery; using Event
- Scheduling of CPU ops: using global thread pools
- Scheduling of GPU ops: using GPU thread pool per GPU device
Reviewed By: dzhulgakov
Differential Revision: D5985110
fbshipit-source-id: a9de7fcbb71d046a3aa1b573072b89a65dfeee8c
Summary: observer framework can now be used in python + a small writeup of how to use it. this is D6035393 with a fix for ct-scan
Reviewed By: salexspb
Differential Revision: D6066380
fbshipit-source-id: 896c4c580d4387240b81ac2dbbc43db51d4bfeb9
Summary: observer framework can now be used in python + a small writeup of how to use it
Reviewed By: sf-wind
Differential Revision: D6035393
fbshipit-source-id: 4563cf0203095fa979bb2160621cd16dd22ff830
Summary: observer framework can now be used in python + a small writeup of how to use it
Reviewed By: salexspb
Differential Revision: D5905002
fbshipit-source-id: e40ec24a55e08fb73beea9b4f3b68e71fc66ffb1
Summary:
This enables opsnoop to work with simple net as opposed
to just dag net
Reviewed By: pietern
Differential Revision: D5721732
fbshipit-source-id: c38d0b51d3b0469ecb2883e7075eeee7acf81d75
Summary:
Right now, each net implements 2 functions: Run() and RunAsync(). The (loose) abstraction is:
* Run(): run the network in a synchronous way. The call is synchronous.
* RunAsync(): run the network *still synchronously*, but potentially use asynchronous scheduling of the underlying operators.
As one can see, this is highly confusing: RunAsync() is actually a sync call, and the semantics it tries to implement should actually be done by a different net type. For example, DAGNet and AsyncDAGNet both implement the Run() function, and under the hood one uses sync scheduling and one uses async scheduling. Currently, the only user of the RunAsync() function is in SimpleNet::RunAsync(). The only call site is in recurrent_net_op.
Instead, the operator implements the two Run() and RunAsync() functions as follows:
* Run(): run the operator in a synchronous way. aka doing FinishDeviceComputation().
* RunAsync(): run the operator in an asynchronous way if possible (i.e. still sync in CPU, but async in cuda), records the action in the event_, and return immediately.
Semantically, Run() is equal to RunAsync() followed by event().Finish().
As a result, we propose in diff D5812854 to change the network interface similar to the operator interface, and explicitly raise RunAsync() as a first class citizen of the net interface. Specifically, whether a net can run asynchronously is now determined by the
* Adding a SupportsAsync() function that determines if a net supports async execution or not.
* Run(): run the net in a synchronous way.
* RunAsync(): if SupportsAsync() is false, same as Run(). if SupportsAsync() is true, run the operator in an asynchronous way, with the scheduling algorithm determined by the implementation itself. Then, record all outstanding events in the events_ field, and return immediately.
Semantically, Run() is equal to RunAsync, and call event.Finish() for all the events. This is actually the implementation and Run() is no longer a virtual function, RunAsync() is: all sub classes of NetBase shall implement SupportsAsync() and RunAsync() now.
**Why SupportsAsync()?**
This is a design idea that probably needs iterating. Basically, the idea is that RunAsync() is the main entry for the net execution, and it's actually like RunAsyncIfTheNetSupportsIt().
In theory, Run() is basically a wrapper on top of RunAsync() to reduce code duplication: if a net type does not support RunAsync(), its RunAsync() implementation simply is sync (see e.g. SimpleNet) and the Run() to RunAsync() lowering is a no-op (with the only overhead being a nested function call).
I exposed the SupportsAsync() function just in case some caller wants to explicitly check whether an instantiated net supports async call or not - for example, a caller may want to make sure that it is actually running a net asynchronously, in which case SupportsAsync() is the place to query.
Reviewed By: dzhulgakov
Differential Revision: D5812854
fbshipit-source-id: 916b38fded0eb14439f340ab254a034ac5a9a465
Summary: TEST_benchmark will print out gflops if it can infer them
Reviewed By: Maratyszcza
Differential Revision: D5412644
fbshipit-source-id: 3af7bb42cda4684e30db6d8ae5484d441898479c