Commit Graph

475 Commits

Author SHA1 Message Date
David Reiss
ad8c0e57ef Add a command-line flag for overriding pthreadpool size (#46781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46781

Test Plan: Passed it to speed_benchmark_torch and saw perf change.

Reviewed By: iseeyuan

Differential Revision: D24752889

Pulled By: dreiss

fbshipit-source-id: 762981510f271d20f76e33b6e6f361c4a6f48e6c
2020-11-05 21:30:54 -08:00
Tristan Rice
0c9787c758 caffe2: use at::mt19937 instead of std::mt19937 (10x speedup) (#43987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987

This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)

For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.

Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.

Reviewed By: dzhulgakov

Differential Revision: D23219710

fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
2020-10-16 16:08:35 -07:00
Yinghai Lu
a92b49f7c8 [Onnxifi] Don't throw exception when we cannot write out debug files (#45979)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45979

For some reason, sometime we cannot write out the debug files. This shouldn't block the whole service. Hence, we opt in to error out instead of throw error.

Test Plan: Run net_runner test at `/` and observe error being printed out but the test passes.

Reviewed By: ipiszy

Differential Revision: D24165081

fbshipit-source-id: a4e1d0479d54d741e615e3a00b3003f512394fd4
2020-10-08 00:18:24 -07:00
Michael Suo
18253f4a48 Fix BUILD_CAFFE2 if FBGEMM and NNPACK are not built (#45610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45610

Also add to the usual documentation places that this option exists.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24058199

Pulled By: suo

fbshipit-source-id: 81574fbd042f47587e2c7820c726fac0f68af2a7
2020-10-01 14:58:55 -07:00
Xiang Gao
0a15646e15 CUDA RTX30 series support (#45489)
Summary:
I also opened a PR on cmake upstream: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/5292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45489

Reviewed By: zhangguanheng66

Differential Revision: D23997844

Pulled By: ezyang

fbshipit-source-id: 4e7443dde9e70632ee429184f0d51cb9aa5a98b5
2020-09-29 18:19:23 -07:00
Lingyi Liu
2d884f2263 Optimize Scale function (#44913)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/18322

Optimize Scale function

i-am-not-moving-c2-to-c10

Test Plan: buck test mode/dbg caffe2/caffe2/python/operator_test:weighted_sum_test

Reviewed By: BIT-silence

Differential Revision: D14575780

fbshipit-source-id: db333a7964581dcaff6e432ff1d6b517ba1a075f
2020-09-18 14:31:33 -07:00
Nikita Shulga
2ae74c0632 Compile less legacy code when BUILD_CAFFE2 is set to False (take 2) (#44453)
Summary:
2nd attempt to land https://github.com/pytorch/pytorch/pull/44079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44453

Reviewed By: walterddr, seemethere

Differential Revision: D23619528

Pulled By: malfet

fbshipit-source-id: c7c206ebd327dcf3994789bd47008b05ff862fe7
2020-09-11 16:27:47 -07:00
Yangxin Zhong
514f20ea51 Histogram Binning Calibration
Summary:
Adding a calibration module called histogram binning:

Divide the prediction range (e.g., [0, 1]) into B bins. In each bin, use two parameters to store the number of positive examples and the number of examples that fall into this bucket. So we basically have a histogram for the model prediction.

As a result, for each bin, we have a statistical value for the real CTR (num_pos / num_example). We use this statistical value as the final calibrated prediction if the pre-cali prediction falls into the corresponding bin.

In this way, the predictions within each bin should be well-calibrated if we have sufficient examples. That is, we have a fine-grained calibrated model by this calibration module.

Theoretically, this calibration layer can fix any uncalibrated model or prediction if we have sufficient bins and examples. It provides the potential to use any kind of training weight allocation to our training data, without worrying about the calibration issue.

Test Plan:
buck test dper3/dper3/modules/calibration/tests:calibration_test -- test_histogram_binning_calibration

buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_histogram_binning_calibration

All tests passed.

Example workflows:
f215431958

{F326445092}

f215445048

{F326445223}

Reviewed By: chenshouyuan

Differential Revision: D23356450

fbshipit-source-id: c691b66c51ef33908c17575ce12e5bee5fb325ff
2020-09-06 17:11:16 -07:00
Wanchao Liang
d07a36e0c1 Revert D23490149: [pytorch][PR] Compile less legacy code when BUILD_CAFFE2 is set to False
Test Plan: revert-hammer

Differential Revision:
D23490149 (15e99b6ff6)

Original commit changeset: a76382c30d83

fbshipit-source-id: 75057fa9af2c19eb976962552118bf0a99911b38
2020-09-04 22:59:39 -07:00
Nikita Shulga
15e99b6ff6 Compile less legacy code when BUILD_CAFFE2 is set to False (#44079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44079

Reviewed By: walterddr

Differential Revision: D23490149

Pulled By: malfet

fbshipit-source-id: a76382c30d83127d180ec63ac15093a7297aae53
2020-09-04 20:04:21 -07:00
Hao Lu
39b4701d31 [caffe2][redo] Reimplement RemoveOpsByType with SSA (#41606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41606

The previous diff (D22220798 (59294fbbb9) and D22220797) was recently reverted (D22492356 (28291d3cf8), D22492355) because of a bug associated with the op AsyncIf. The AsyncIf op has net_defs as args and the SSA rewriting didn't take that into account. It has a special path for the op If, but not for AsyncIf. Several changes I made to fix the bug:
1) Add op AsyncIf to the special path for If op in SSA rewriting
2) clear inputs/outputs of the netdefs that are args in If/AsyncIf ops because they're no longer valid
3) revert renamed inputs/outputs in the arg netdefs that are in the external_outputs in the parent netdef

2) and 3) are existing bugs in the `SsaRewrite` function that were just never exposed before.

The algorithm for `RemoveOpsByType` is the same as in my previous diff D22220798 (59294fbbb9). The only new changes in this diff are in `onnx::SsaRewrite` and a few newly added unit tests.

(Note: this ignores all push blocking failures!)

Reviewed By: yinghai

Differential Revision: D22588652

fbshipit-source-id: ebb68ecd1662ea2bae14d4be8f61a75cd8b7e3e6
2020-07-17 16:06:43 -07:00
David Reiss
b7e044f0e5 Re-apply PyTorch pthreadpool changes
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.

Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`

Reviewed By: xcheng16

Differential Revision: D22199952

fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5
2020-06-23 19:26:21 -07:00
Kate Mormysh
92d3182c11 Revert D21232894: Unify PyTorch mobile's threadpool usage.
Test Plan: revert-hammer

Differential Revision:
D21232894 (b9d3869df3)

Original commit changeset: 8b3de86247fb

fbshipit-source-id: e6517cfec08f7dd0f4f8877dab62acf1d65afacd
2020-06-23 17:09:14 -07:00
Ashkan Aliabadi
b9d3869df3 Unify PyTorch mobile's threadpool usage. (#37243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37243

*** Why ***

As it stands, we have two thread pool solutions concurrently in use in PyTorch mobile: (1) the open source pthreadpool library under third_party, and (2) Caffe2's implementation of pthreadpool under caffe2/utils/threadpool.  Since the primary use-case of the latter has been to act as a drop-in replacement for the third party version so as to enable integration and usage from within NNPACK and QNNPACK, Caffe2's implementation is intentionally written to the exact same interface as the third party version.

The original argument in favor of C2's implementation has been improved performance as a result of using spin locks, as opposed to relinquishing the thread's time slot and putting it to sleep - a less expensive operation up to a point.  That seems to have given C2's implementation the upper hand in performance, hence justifying the added maintenance complexity, until the third party version improved in parallel surpassing the efficiency of C2's implementation as I have verified in benchmarks.  With that advantage gone, there is no reason to continue using C2's implementation in PyTorch mobile either from the perspective of performance or code hygiene.  As a matter of fact, there is considerable performance benefit to be had as a result of using the third party version as it currently stands.

This is a tricky change though, mainly because in order to avoid potential performance regressions, of which I have witnessed none but just in abundance of caution, we have decided to continue using the internal C2's implementation whenever building for Caffe2.  Again, this is mainly to avoid potential performance regressions in production C2 use cases even if doing so results in reduced performance as far as I can tell.

So to summarize, today, and as it currently stands, we are using C2's implementation for (1) NNPACK, (2) PyTorch QNNPACK, and (3) ATen parallel_for on mobile builds, while using the third party version of pthreadpool for XNNPACK as XNNPACK does not provide any build options to link against an external implementation unlike NNPACK and QNNPACK do.

The goal of this PR then, is to unify all usage on mobile to the third party implementation both for improved performance and better code hygiene.  This applies to PyTorch's use of NNPACK, QNNPACK, XNNPACK, and mobile's implementation of ATen parallel_for, all getting routed to the
exact same third party implementation in this PR.

Considering that NNPACK, QNNPACK, and XNNPACK are not mobile specific, these benefits carry over to non-mobile builds of PyTorch (but not Caffe2) as well.  The implementation of ATen parallel_for on non-mobile builds remains unchanged.

*** How ***

This is where things get tricky.

A good deal of the build system complexity in this PR arises from our desire to maintain C2's implementation intact for C2's use.

pthreadpool is a C library with no concept of namespaces, which means two copies of the library cannot exist in the same binary or symbol collision will occur violating ODR.  This means that somehow, and based on some condition, we must decide on the choice of a pthreadpool implementation.  In practice, this has become more complicated as a result of all the possible combinations that USE_NNPACK, USE_QNNPACK, USE_PYTORCH_QNNPACK, USE_XNNPACK, USE_SYSTEM_XNNPACK, USE_SYSTEM_PTHREADPOOL and other variables can result in.  Having said that, I have done my best in this PR to surgically cut through this complexity in a way that minimizes the side effects, considering the significance of the performance we are leaving on the table, yet, as a result of this combinatorial explosion explained above I cannot guarantee that every single combination will work as expected on the first try.  I am heavily relying on CI to find any issues as local testing can only go that far.

Having said that, this PR provides a simple non mobile-specific C++ thread pool implementation on top of pthreadpool, namely caffe2::PThreadPool that automatically routes to C2's implementation or the third party version depending on the build configuration.  This simplifies the logic at the cost of pushing the complexity to the build scripts.  From there on, this thread pool is used in aten parallel_for, and NNPACK and family, again, routing all usage of threading to C2 or third party pthreadpool depending on the build configuration.

When it is all said or done, the layering will look like this:

a) aten::parallel_for, uses
b) caffe2::PThreadPool, which uses
c) pthreadpool C API, which delegates to
    c-1) third_party implementation of pthreadpool if that's what the build has requested, and the rabbit hole ends here.
    c-2) C2's implementation of pthreadpool if that's what the build has requested, which itself delegates to
    c-2-1) caffe2::ThreadPool, and the rabbit hole ends here.

NNPACK, and (PyTorch) QNNPACK directly hook into (c). They never go through (b).

Differential Revision: D21232894

Test Plan: Imported from OSS

Reviewed By: dreiss

Pulled By: AshkanAliabadi

fbshipit-source-id: 8b3de86247fbc3a327e811983e082f9d40081354
2020-06-23 16:34:51 -07:00
Xiang Gao
b3fac8af6b Initial support for building on Ampere GPU, CUDA 11, cuDNN 8 (#39277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39277

This PR contains initial changes that makes PyTorch build with Ampere GPU, CUDA 11, and cuDNN 8.
TF32 related features will not be included in this PR.

Test Plan: Imported from OSS

Differential Revision: D21832814

Pulled By: malfet

fbshipit-source-id: 37f9c6827e0c26ae3e303580f666584230832d06
2020-06-02 10:03:42 -07:00
Xiang Gao
5e2d8745c8 RIP CUDA <9.2: circleci, aten, and caffe2 (#36846)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36846

Test Plan: Imported from OSS

Differential Revision: D21620850

Pulled By: ngimel

fbshipit-source-id: 7ad1676a12f86250f301095ffc6f365a3b370f34
2020-05-18 13:41:05 -07:00
peter
ec7beda822 Use thrust::host_vector instead of std::vector (#38178)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38024.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38178

Differential Revision: D21502379

Pulled By: ezyang

fbshipit-source-id: 74dd6504c56f4150ed4cef129fd3f32f378c0564
2020-05-11 20:34:04 -07:00
Nikita Shulga
44345ad08c Do not define C10_IOS on Mac (#37283)
Summary:
Because MacOS is not iOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37283

Test Plan: CI

Differential Revision: D21244398

Pulled By: malfet

fbshipit-source-id: b822e216e83887e2f2961b5c5384eaf749629f61
2020-04-25 13:52:46 -07:00
Richard J. Knight
93cd05b0f4 Fix CMake errors on systems where {Q/X}NNPACK is not supported (#35607)
Summary:
- add a couple of checks for USE_XNNPACK to disable additional
  code paths if XNNPACK is not supported

When passing through the code paths where the platform checks
are made (cmake/Dependencies.cmake:89), if XNNPACK is not
supported, then the var FXDIV_SOURCE_DIR will not be
set. CMake emits the errors when add_directory is called and
FXDIV_SOURCE_DIR is empty.

see: https://github.com/pytorch/pytorch/issues/34606
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35607

Differential Revision: D20895645

Pulled By: seemethere

fbshipit-source-id: 3bd10cf89f0fb6825fdd6e1d52c71ee37c67b953
2020-04-24 12:37:23 -07:00
Dmytro Dzhulgakov
49457a7be7 Logging for ATen op subtype
Summary: ATenOp should go away, but before it does it's important to understand what's going inside of it. We already log `arguments`, but it's rather hard to parse in scuba as its a list, not a dictionary. Let's extract operator name explicitly so that grouping works well

Test Plan: unittest

Reviewed By: ngimel

Differential Revision: D21057966

fbshipit-source-id: 86be7cca39055620477a28bd5d8ab29e8edd2ff9
2020-04-19 23:02:50 -07:00
Nikita Shulga
f548946363 Fix out-of-boundary access in caffe2::StartsWith (#36672)
Summary:
`std::mismatch( InputIt1 first1, InputIt1 last1, InputIt2 first2 )` assumes that container for `first2` iterator contains at least `last1 - first` elements, which is not the case if `prefix` is longer than `str`
Found while running unit tests on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36672

Differential Revision: D21049407

Pulled By: malfet

fbshipit-source-id: ad45779d47a0c6898900e0247c920829a2179f62
2020-04-15 20:40:59 -07:00
Dmytro Dzhulgakov
7576cf8d00 [caffe2] Use cpuinfo in perfkernels to simplify build dependency (#36371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36371

It allows to drop circular dependency and remove unknown_symbols in Buck build.

It'd be good to get rid of GetCpuId all together in favor of cpuinfo, but it's not really blocking anything

Reviewed By: malfet

Differential Revision: D20958000

fbshipit-source-id: ed17a2a90a51dc1adf9e634af56c85f0689f8f29
2020-04-10 13:26:34 -07:00
Johannes M Dieterich
be125d18dd [ROCm] [ROCm 2.10+] enable fp16 dot in Caffe2 backend (#30432)
Summary:
ROCm 2.10 has a hdot implementation, use it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30432

Differential Revision: D20777482

Pulled By: ezyang

fbshipit-source-id: b4826cc399faa08bd83047375283b17bcd2477eb
2020-04-03 08:01:23 -07:00
Dmytro Dzhulgakov
1f759936f0 Propagate model id used by Predictor to Caffe2 logging
Summary:
Does the same things as D19658565 but for Caffe2 models.

From investigation https://fb.quip.com/PbgsAEmoJVuf the model id that predictor uses and the model id saved inside the model don't match. Common reason is recurring fluent2 jobs but there are others.

Since model_id from predictor is what the rest of datasets use, it's way more useful imho. I've considered adding both ids, but it'd require additional piping and I don't think it's that useful.

Test Plan: unittests added

Reviewed By: houseroad

Differential Revision: D20630599

fbshipit-source-id: 3e6d0cb0b6f8c8b6ae5935138f55ae7a2ff60653
2020-03-29 23:07:32 -07:00
peter
45c9ed825a Formatting cmake (to lowercase without space for if/elseif/else/endif) (#35521)
Summary:
Running commands:
```bash
shopt -s globstar

sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i caffe2/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i torch/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i c10/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake.in
```
We may further convert all the commands into lowercase according to the following issue: 77543bde41.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35521

Differential Revision: D20704382

Pulled By: malfet

fbshipit-source-id: 42186b9b1660c34428ab7ceb8d3f7a0ced5d2e80
2020-03-27 14:25:17 -07:00
Kevin Matzen
6d8649dc53 [caffe2] fix Transpose2D calls in NHWC<->NCHW (#34625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34625

These templated function calls are not specifying the template args correctly.  The first arg is the index type, not the array data type.  That means, right now it's using `T` as the index type as well, which will break if we do a template specialization for uint8_t.  If we omit both, it will correctly infer that the index type is `int` and the data type is `T`.

Reviewed By: BIT-silence

Differential Revision: D20358728

fbshipit-source-id: 8cbd8eeb14bce602c02eb6fce2cc141f0121fa24
2020-03-16 15:18:44 -07:00
Yinghai Lu
79e1305519 [net_runner] Get shape info from qtensors (#34321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34321

Mostly cosmetic as we can infer the shape anyway. It can remove a lot of the noise in the log though.

Note that weight sharing doesn't work yet. I'll add another diff to address this.

Reviewed By: houseroad

Differential Revision: D20290841

fbshipit-source-id: fe6f9b60d05dbe150af15b5d9d7a69fd902e12cc
2020-03-09 18:34:16 -07:00
Kimish Patel
8269c4f3d3 Added nullptr check for pthradpool_get_threads_count (#34087)
Summary:
We get seg fault without this in using XNNPACK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34087

Differential Revision: D20199787

Pulled By: kimishpatel

fbshipit-source-id: d3d274e7bb197461632b21688820cd4c10dcd819
2020-03-04 11:10:53 -08:00
Michael Ranieri
51d969e86a preprocessor cleanup (#33957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33957

lots of small preprocessor warning cleanup for windows

Test Plan: CI green

Reviewed By: malfet, albanD

Differential Revision: D20153582

fbshipit-source-id: 18fd61c466fd1f55ededdae4448b3009a9cedc04
2020-03-02 13:37:19 -08:00
Kimish Patel
0e52627358 Fixing pthreadpool symbol conflict issue. (#33869)
Summary:
Mainly renaming pthread_create of C2, the only one referred internally in NNPACK, that
is conflicting, to pthread_create_c2.
Removed 2 other conflicting symbols that are not used internally at all.
Pointing XNNPACK to original repo instead of the fork.

Copy pasted the new interface and implementation to
caff2/utils/threadpool, so that for internal builds we compile against
this.

When threadpool is unified this will be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33869

Differential Revision: D20140580

Pulled By: kimishpatel

fbshipit-source-id: de70df0af9c7d6bc065e85ede0e1c4dd6a9e6be3
2020-02-28 21:23:18 -08:00
Igor Sugak
23846d5a38 [caffe2] use Clang identification macro in various places (#33574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33574

Sprinkle with Clang identification macro places that otherwise would cause build errors when Clang is used to drive the CUDA compilation.

Note: `__clang__` is defined when either Clang is used as host compiler by NVCC or when Clang drives the compilation. `__CUDA__` is defined only for the latter case.

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```

Reviewed By: BIT-silence

Differential Revision: D20007440

fbshipit-source-id: 53caa70695b99461a3910d41dc71a9f6d0728a75
2020-02-20 15:16:11 -08:00
Igor Sugak
108fc78395 [caffe2] fix invalid % escape in inline assembly strings (#33554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33554

NVCC/GCC accepts the existing syntax, but not Clang which requires a proper escape. Here `%laneid` is one of the many registers that CUDA's pseudo-asm provides [1]. And using the extra `%` doesn't change the semantics, as PTX expects `%laneid` value after it's processed by the asm tool.

1. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow

Reviewed By: bddppq

Differential Revision: D20003621

fbshipit-source-id: 8e550e55a3455925e7bd92c6df3e504b5d38c2dc
2020-02-20 14:31:52 -08:00
Hao Lu
81394581a3 [Caffe2][ThreadPool] Make sure numThreads does not exceed the number of big cores (#33523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33523

When using `ThreadPool::setNumThreads` to set the number of threads, it should not exceed the number of big cores. Otherwise, the performance could degrade significantly.

Test Plan:
```
cd ~/fbsource/xplat
buck test caffe2:caffe2_testAndroid
```

Reviewed By: dreiss

Differential Revision: D19779267

fbshipit-source-id: 4e980e8a0ccc2f37e1c8ed16e2f4651d72924dbd
2020-02-19 18:24:24 -08:00
Brian Wignall
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
James Donald
84dfa96f62 Fix -Wundef warning in conversions.h
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31911

Test Plan:
* CI builds including GPU and OSS-build tests
* The `defined(__HIP_DEVICE_COMPILE__) ` instance a few lines below is proof that this is a define/undef flag, not a define01 flag

Reviewed By: hlu1

Differential Revision: D19296560

fbshipit-source-id: 1c45069aec534b0bf4a87751a74680675c985e06
2020-01-08 08:39:37 -08:00
Sebastian Messmer
643ca5def2 Replace c10::guts::stuff with std::stuff (#30915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30915

Since we now have C++14, we don't need these c10::guts helpers anymore
ghstack-source-id: 95777609

Test Plan: waitforsandcastle

Differential Revision: D18869639

fbshipit-source-id: 97716f932297c64c6e814410ac47b444c33d4e2e
2019-12-16 13:57:19 -08:00
Ivan Kobzarev
ca8cb3241a Expose setNumThreads to android api (#31205)
Summary:
PR https://github.com/pytorch/pytorch/pull/31033 was unlanded due to macos build failure:
https://app.circleci.com/jobs/github/pytorch/pytorch/3916388

This PR has changes that `setNumThreads` is only for android and moved to separate class `org.pytorch.PytorchAndroid` as a static function which is better as it has global effect
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31205

Reviewed By: dreiss

Differential Revision: D18977250

Pulled By: IvanKobzarev

fbshipit-source-id: 4995859808af498c82933c4db52bd7c7dfae90e5
2019-12-12 18:57:27 -08:00
Michael Suo
c0bcfd0445 Revert D18923167: Expose setNumThreads to android api
Test Plan: revert-hammer

Differential Revision:
D18923167

Original commit changeset: 8d98c2edbff4

fbshipit-source-id: 7db37cff298c511d0dd9eb373811c769e4a73be9
2019-12-12 09:23:58 -08:00
Ivan Kobzarev
6225443009 Expose setNumThreads to android api (#31033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31033

Intention:
There are requests from users to control number of threads from android side:
https://discuss.pytorch.org/t/android-pytorch-forward-method-running-in-a-separate-thread-slow-down-ui-thread/63516/2
https://discuss.pytorch.org/t/threading-of-model-pytorch-android/62490/2

At the moment `setNumThreads` is placed in `org.pytorch.Module`, but this method changes global threadPool size, in future we will move it to some separate class to repeat python binding structure, which has torch.set_num_threads()

Test Plan: Imported from OSS

Differential Revision: D18923167

Pulled By: IvanKobzarev

fbshipit-source-id: 8d98c2edbff42e9b673509672dce3f2dd03a923e
2019-12-11 14:20:14 -08:00
Tao Xu
b730d04ed2 Fix deadlock issues in ThreadPool (#29885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29885

### Summary

Currently, we have a deadlock issue on iOS when running Resnet50. The problem happens when the task being run in the ThreadPool wants to call `getNumThread()` who will try to acquire the same mutex. And thus cause the deadlock situation. The fix is just remove the guard for `_numThreads`, as it's not likely to change after initialization.

### Test Plan

1. Generate a Resnet50 model using trace_model.py
2. Run `ios/TestApp/bootstrap.sh` to do the benchmark

cc shoumikhin AshkanAliabadi

Test Plan: Imported from OSS

Differential Revision: D18533505

Pulled By: xta0

fbshipit-source-id: 2a069d20b59833ec8b02ff05515c3739a85a15de
2019-11-15 19:27:52 -08:00
Elliott Clark
ad58045af9 Remove LOG(INFO) from math_cpu.cc (#27001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27001

This unconditional log line spams the logs enough that it's a drag on cpu and will eventually fill up logs.

Test Plan: Allow unit test and automated testing to give feedback.

Reviewed By: jspark1105

Differential Revision: D17638140

fbshipit-source-id: 4e8a44bda31327ba7e797f7579a9e3bf866eef7e
2019-09-27 16:37:49 -07:00
Jongsoo Park
d5490c662e batch size 0 tests in BatchMatMul ops (#26874)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26874

Add batch_size == 0 testings of BatchMatMul DNNLOWP operator.

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17596117

fbshipit-source-id: 029e29e6c2bd7894d83dac46e8ce8484cc92b1c0
2019-09-26 16:08:39 -07:00
Jiakai Liu
6d0b004574 rename caffe2::mobile_threadpool to caffe2::mobile_pthreadpool
Summary:
Rename old mobile_threadpool() API, replace it with a new version that
returns caffe2::ThreadPool instead of pthreadpool_t.

Test Plan: - builds

Differential Revision: D17543413

Pulled By: ljk53

fbshipit-source-id: a3effd24e8ce9d677a2a04ebe6b6e1582e6f0a65
2019-09-24 22:27:35 -07:00
Jiakai Liu
67c530851c get rid of protobuf dependencies (#25650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25650

This PR removes protobuf dependencies from mobile build altogether:
- caffe2/proto: protobuf files, including caffe2.proto and torch.proto;
- caffe2 components that depend on caffe2.proto, including most part of
caffe2/core, caffe2/utils;
- libprotobuf / libprotobuf-lite dependencies;
- protobuf compiler;
- some utils class, e.g.: netdef_converter.cpp;
- introduce a macro to disable third_party/onnx which depends on protobuf;

Test Plan:
- builds;
- link with demo app to make sure it can load and run a model in pickle format;

Differential Revision: D17183548

Pulled By: ljk53

fbshipit-source-id: fe60b48674f29c4a9b58fd1cf8ece44191491531
2019-09-06 08:48:20 -07:00
Jiakai Liu
a3d0abf729 move GetDimFromOrderString to caffe2/core/types.h (#25671)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25671

To decouple string_utils.h from types.h and protobuf headers.
Logically GetDimFromOrderString seems to be more similiar to
StringToStorageOrder comparing to other string_utils functions.

Test Plan: - Will check all internal/external CI jobs.

Reviewed By: yinghai

Differential Revision: D17191912

Pulled By: ljk53

fbshipit-source-id: fe555feef27bfd74c92b6297c12fb668252ca9ff
2019-09-05 04:32:04 -07:00
iotamudelta
4fe857187c switch to rocThrust for thrust/cub APIs (#25620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602

Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.

Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.

Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.

Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.

Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864

Reviewed By: xw285cornell

Differential Revision: D16940768

Pulled By: bddppq

fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
2019-09-03 22:16:30 -07:00
Zachary DeVito
6a48a5b65c Fix more warnings
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24291

Test Plan: Imported from OSS

Differential Revision: D16795898

Pulled By: zdevito

fbshipit-source-id: cbd5f2dd4e3bbd361909ae13c243561899568ad0
2019-08-14 17:47:54 -07:00
Supriya Rao
40db964455 Add support for using caffe2::ThreadPool in pytorch mobile QNNPACK. (#23658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23658

**How things work for caffe2:**
Caffe2 Ops -> NNPACK/QNNPACK -> pthreadpool_compute_1/2/3/4d_tiled -> pthreadpool_compute_1d (caffe2 shim) -> caffe2::ThreadPool

**Before this PR:**
Pytorch Ops -> NNPACK/QNNPACK -> pthreadpool_compute_1/2/3/4d_tiled -> pthreadpool_compute_1d (third_party implementation without mobile optimization)

caffe2::ThreadPool is optimized for mobile. This change leverages this logic for pytorch mobile as a temporary solution improve pytorch mobile perf. It is guarded by the C10_MOBILE macro.
For server side we return nullptr.

**Plan for next steps:**
Implement a mobile version of "at::parallel_for" which uses caffe2::ThreadPool internally so all ATen/TH multithreading usage is mobile optimized.
Refactor QNNPACK and/or pthreadpool to explicitly using "at::parallel_for" primitive to replace pthreadpool_compute_1d for Pytorch.
After QNNPACK is refactored, we will delete the mobile_threadpool() API.

ghstack-source-id: 88073396

Reviewed By: dreiss

Differential Revision: D16594020

fbshipit-source-id: 9f94600756d5f86d24a12a2fd7df3eebd0994f1d
2019-08-12 18:14:15 -07:00
Gregory Chanan
2f03205c65 Support torch::tensor and at::tensor with bool and BFloat16 dtypes.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23337

Test Plan: Imported from OSS

Differential Revision: D16467979

Pulled By: gchanan

fbshipit-source-id: 2e6ad431c47a61c917d501390d14c55b788958ab
2019-08-09 12:36:35 -07:00
Hong Xu
513c4291c5 Suppress implicit-fallthrough warning on g++ >= 7 in caffe2/utils/math_cpu.cc (#24053)
Summary:
These implicit fallthroughs lead to the following warning on g++ 7, because g++ could not recognize the implicit `abort` call in `LOG(FATAL)`. We suppress by adding explicit `return`s.

    /home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc: In function void
    caffe2::math::GemmEx(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int
    , int, int, T, const T*, int, const T*, int, T, T*, int, Context*) [with
    T = float; Context = caffe2::CPUContext; Engine = caf
    fe2::DefaultEngine]:
    /home/hong/wsrc/pytorch/c10/util/logging_is_not_google_glog.h:98:10:
    warning: this statement may fall through [-Wimplicit-fall
    through=]
       ::c10::MessageLogger((char*)__FILE__, __LINE__, n).stream()
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:179:11: note: in
    expansion of macro LOG
               LOG(FATAL) << "Unexpected CBLAS_TRANSPOSE for trans_B";
               ^
    /home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:182:5: note: here
         case CblasTrans: {
         ^~~~
    In file included from /home/hong/wsrc/pytorch/c10/util/Logging.h:28:0,
                     from /home/hong/wsrc/pytorch/caffe2/core/logging.h:2,
                     from /home/hong/wsrc/pytorch/caffe2/core/types.h:9,
                     from /home/hong/wsrc/pytorch/caffe2/utils/math.h:17,
                     from
    /home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:14:
    /home/hong/wsrc/pytorch/c10/util/logging_is_not_google_glog.h:98:10:
    warning: this statement may fall through [-Wimplicit-fall
    through=]
       ::c10::MessageLogger((char*)__FILE__, __LINE__, n).stream()
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:202:11: note: in
    expansion of macro LOG
               LOG(FATAL) << "Unexpected CBLAS_TRANSPOSE for trans_B";
               ^
    /home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:205:5: note: here
         default:
         ^~~~~~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24053

Differential Revision: D16732530

Pulled By: ezyang

fbshipit-source-id: 90373879f25b52efca5bf151c7ed58d6ad19d925
2019-08-09 09:17:23 -07:00