Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43987
This replaces the caffe2 CPU random number (std::mt19937) with at::mt19937 which is the one currently used in pytorch. The ATen RNG is 10x faster than the std one and appears to be more robust given bugs in the std (https://fburl.com/diffusion/uhro7lqb)
For large embedding tables (10GB+) we see UniformFillOp taking upwards of 10 minutes as we're bottlenecked on the single threaded RNG. Swapping to at::mt19937 cuts that time to 10% of the current.
Test Plan: Ran all relevant tests + CI. This doesn't introduce new features (+ is a core change) so existing tests+CI should be sufficient to catch regressions.
Reviewed By: dzhulgakov
Differential Revision: D23219710
fbshipit-source-id: bd16ed6415b2933e047bcb283a013d47fb395814
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45979
For some reason, sometime we cannot write out the debug files. This shouldn't block the whole service. Hence, we opt in to error out instead of throw error.
Test Plan: Run net_runner test at `/` and observe error being printed out but the test passes.
Reviewed By: ipiszy
Differential Revision: D24165081
fbshipit-source-id: a4e1d0479d54d741e615e3a00b3003f512394fd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45610
Also add to the usual documentation places that this option exists.
Test Plan: Imported from OSS
Reviewed By: gmagogsfm
Differential Revision: D24058199
Pulled By: suo
fbshipit-source-id: 81574fbd042f47587e2c7820c726fac0f68af2a7
Summary:
Adding a calibration module called histogram binning:
Divide the prediction range (e.g., [0, 1]) into B bins. In each bin, use two parameters to store the number of positive examples and the number of examples that fall into this bucket. So we basically have a histogram for the model prediction.
As a result, for each bin, we have a statistical value for the real CTR (num_pos / num_example). We use this statistical value as the final calibrated prediction if the pre-cali prediction falls into the corresponding bin.
In this way, the predictions within each bin should be well-calibrated if we have sufficient examples. That is, we have a fine-grained calibrated model by this calibration module.
Theoretically, this calibration layer can fix any uncalibrated model or prediction if we have sufficient bins and examples. It provides the potential to use any kind of training weight allocation to our training data, without worrying about the calibration issue.
Test Plan:
buck test dper3/dper3/modules/calibration/tests:calibration_test -- test_histogram_binning_calibration
buck test dper3/dper3_models/ads_ranking/tests:model_paradigm_e2e_tests -- test_sparse_nn_histogram_binning_calibration
All tests passed.
Example workflows:
f215431958
{F326445092}
f215445048
{F326445223}
Reviewed By: chenshouyuan
Differential Revision: D23356450
fbshipit-source-id: c691b66c51ef33908c17575ce12e5bee5fb325ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41606
The previous diff (D22220798 (59294fbbb9) and D22220797) was recently reverted (D22492356 (28291d3cf8), D22492355) because of a bug associated with the op AsyncIf. The AsyncIf op has net_defs as args and the SSA rewriting didn't take that into account. It has a special path for the op If, but not for AsyncIf. Several changes I made to fix the bug:
1) Add op AsyncIf to the special path for If op in SSA rewriting
2) clear inputs/outputs of the netdefs that are args in If/AsyncIf ops because they're no longer valid
3) revert renamed inputs/outputs in the arg netdefs that are in the external_outputs in the parent netdef
2) and 3) are existing bugs in the `SsaRewrite` function that were just never exposed before.
The algorithm for `RemoveOpsByType` is the same as in my previous diff D22220798 (59294fbbb9). The only new changes in this diff are in `onnx::SsaRewrite` and a few newly added unit tests.
(Note: this ignores all push blocking failures!)
Reviewed By: yinghai
Differential Revision: D22588652
fbshipit-source-id: ebb68ecd1662ea2bae14d4be8f61a75cd8b7e3e6
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.
Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`
Reviewed By: xcheng16
Differential Revision: D22199952
fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37243
*** Why ***
As it stands, we have two thread pool solutions concurrently in use in PyTorch mobile: (1) the open source pthreadpool library under third_party, and (2) Caffe2's implementation of pthreadpool under caffe2/utils/threadpool. Since the primary use-case of the latter has been to act as a drop-in replacement for the third party version so as to enable integration and usage from within NNPACK and QNNPACK, Caffe2's implementation is intentionally written to the exact same interface as the third party version.
The original argument in favor of C2's implementation has been improved performance as a result of using spin locks, as opposed to relinquishing the thread's time slot and putting it to sleep - a less expensive operation up to a point. That seems to have given C2's implementation the upper hand in performance, hence justifying the added maintenance complexity, until the third party version improved in parallel surpassing the efficiency of C2's implementation as I have verified in benchmarks. With that advantage gone, there is no reason to continue using C2's implementation in PyTorch mobile either from the perspective of performance or code hygiene. As a matter of fact, there is considerable performance benefit to be had as a result of using the third party version as it currently stands.
This is a tricky change though, mainly because in order to avoid potential performance regressions, of which I have witnessed none but just in abundance of caution, we have decided to continue using the internal C2's implementation whenever building for Caffe2. Again, this is mainly to avoid potential performance regressions in production C2 use cases even if doing so results in reduced performance as far as I can tell.
So to summarize, today, and as it currently stands, we are using C2's implementation for (1) NNPACK, (2) PyTorch QNNPACK, and (3) ATen parallel_for on mobile builds, while using the third party version of pthreadpool for XNNPACK as XNNPACK does not provide any build options to link against an external implementation unlike NNPACK and QNNPACK do.
The goal of this PR then, is to unify all usage on mobile to the third party implementation both for improved performance and better code hygiene. This applies to PyTorch's use of NNPACK, QNNPACK, XNNPACK, and mobile's implementation of ATen parallel_for, all getting routed to the
exact same third party implementation in this PR.
Considering that NNPACK, QNNPACK, and XNNPACK are not mobile specific, these benefits carry over to non-mobile builds of PyTorch (but not Caffe2) as well. The implementation of ATen parallel_for on non-mobile builds remains unchanged.
*** How ***
This is where things get tricky.
A good deal of the build system complexity in this PR arises from our desire to maintain C2's implementation intact for C2's use.
pthreadpool is a C library with no concept of namespaces, which means two copies of the library cannot exist in the same binary or symbol collision will occur violating ODR. This means that somehow, and based on some condition, we must decide on the choice of a pthreadpool implementation. In practice, this has become more complicated as a result of all the possible combinations that USE_NNPACK, USE_QNNPACK, USE_PYTORCH_QNNPACK, USE_XNNPACK, USE_SYSTEM_XNNPACK, USE_SYSTEM_PTHREADPOOL and other variables can result in. Having said that, I have done my best in this PR to surgically cut through this complexity in a way that minimizes the side effects, considering the significance of the performance we are leaving on the table, yet, as a result of this combinatorial explosion explained above I cannot guarantee that every single combination will work as expected on the first try. I am heavily relying on CI to find any issues as local testing can only go that far.
Having said that, this PR provides a simple non mobile-specific C++ thread pool implementation on top of pthreadpool, namely caffe2::PThreadPool that automatically routes to C2's implementation or the third party version depending on the build configuration. This simplifies the logic at the cost of pushing the complexity to the build scripts. From there on, this thread pool is used in aten parallel_for, and NNPACK and family, again, routing all usage of threading to C2 or third party pthreadpool depending on the build configuration.
When it is all said or done, the layering will look like this:
a) aten::parallel_for, uses
b) caffe2::PThreadPool, which uses
c) pthreadpool C API, which delegates to
c-1) third_party implementation of pthreadpool if that's what the build has requested, and the rabbit hole ends here.
c-2) C2's implementation of pthreadpool if that's what the build has requested, which itself delegates to
c-2-1) caffe2::ThreadPool, and the rabbit hole ends here.
NNPACK, and (PyTorch) QNNPACK directly hook into (c). They never go through (b).
Differential Revision: D21232894
Test Plan: Imported from OSS
Reviewed By: dreiss
Pulled By: AshkanAliabadi
fbshipit-source-id: 8b3de86247fbc3a327e811983e082f9d40081354
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39277
This PR contains initial changes that makes PyTorch build with Ampere GPU, CUDA 11, and cuDNN 8.
TF32 related features will not be included in this PR.
Test Plan: Imported from OSS
Differential Revision: D21832814
Pulled By: malfet
fbshipit-source-id: 37f9c6827e0c26ae3e303580f666584230832d06
Summary:
Because MacOS is not iOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37283
Test Plan: CI
Differential Revision: D21244398
Pulled By: malfet
fbshipit-source-id: b822e216e83887e2f2961b5c5384eaf749629f61
Summary:
- add a couple of checks for USE_XNNPACK to disable additional
code paths if XNNPACK is not supported
When passing through the code paths where the platform checks
are made (cmake/Dependencies.cmake:89), if XNNPACK is not
supported, then the var FXDIV_SOURCE_DIR will not be
set. CMake emits the errors when add_directory is called and
FXDIV_SOURCE_DIR is empty.
see: https://github.com/pytorch/pytorch/issues/34606
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35607
Differential Revision: D20895645
Pulled By: seemethere
fbshipit-source-id: 3bd10cf89f0fb6825fdd6e1d52c71ee37c67b953
Summary: ATenOp should go away, but before it does it's important to understand what's going inside of it. We already log `arguments`, but it's rather hard to parse in scuba as its a list, not a dictionary. Let's extract operator name explicitly so that grouping works well
Test Plan: unittest
Reviewed By: ngimel
Differential Revision: D21057966
fbshipit-source-id: 86be7cca39055620477a28bd5d8ab29e8edd2ff9
Summary:
`std::mismatch( InputIt1 first1, InputIt1 last1, InputIt2 first2 )` assumes that container for `first2` iterator contains at least `last1 - first` elements, which is not the case if `prefix` is longer than `str`
Found while running unit tests on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36672
Differential Revision: D21049407
Pulled By: malfet
fbshipit-source-id: ad45779d47a0c6898900e0247c920829a2179f62
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36371
It allows to drop circular dependency and remove unknown_symbols in Buck build.
It'd be good to get rid of GetCpuId all together in favor of cpuinfo, but it's not really blocking anything
Reviewed By: malfet
Differential Revision: D20958000
fbshipit-source-id: ed17a2a90a51dc1adf9e634af56c85f0689f8f29
Summary:
Does the same things as D19658565 but for Caffe2 models.
From investigation https://fb.quip.com/PbgsAEmoJVuf the model id that predictor uses and the model id saved inside the model don't match. Common reason is recurring fluent2 jobs but there are others.
Since model_id from predictor is what the rest of datasets use, it's way more useful imho. I've considered adding both ids, but it'd require additional piping and I don't think it's that useful.
Test Plan: unittests added
Reviewed By: houseroad
Differential Revision: D20630599
fbshipit-source-id: 3e6d0cb0b6f8c8b6ae5935138f55ae7a2ff60653
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34625
These templated function calls are not specifying the template args correctly. The first arg is the index type, not the array data type. That means, right now it's using `T` as the index type as well, which will break if we do a template specialization for uint8_t. If we omit both, it will correctly infer that the index type is `int` and the data type is `T`.
Reviewed By: BIT-silence
Differential Revision: D20358728
fbshipit-source-id: 8cbd8eeb14bce602c02eb6fce2cc141f0121fa24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34321
Mostly cosmetic as we can infer the shape anyway. It can remove a lot of the noise in the log though.
Note that weight sharing doesn't work yet. I'll add another diff to address this.
Reviewed By: houseroad
Differential Revision: D20290841
fbshipit-source-id: fe6f9b60d05dbe150af15b5d9d7a69fd902e12cc
Summary:
We get seg fault without this in using XNNPACK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34087
Differential Revision: D20199787
Pulled By: kimishpatel
fbshipit-source-id: d3d274e7bb197461632b21688820cd4c10dcd819
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33957
lots of small preprocessor warning cleanup for windows
Test Plan: CI green
Reviewed By: malfet, albanD
Differential Revision: D20153582
fbshipit-source-id: 18fd61c466fd1f55ededdae4448b3009a9cedc04
Summary:
Mainly renaming pthread_create of C2, the only one referred internally in NNPACK, that
is conflicting, to pthread_create_c2.
Removed 2 other conflicting symbols that are not used internally at all.
Pointing XNNPACK to original repo instead of the fork.
Copy pasted the new interface and implementation to
caff2/utils/threadpool, so that for internal builds we compile against
this.
When threadpool is unified this will be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33869
Differential Revision: D20140580
Pulled By: kimishpatel
fbshipit-source-id: de70df0af9c7d6bc065e85ede0e1c4dd6a9e6be3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33574
Sprinkle with Clang identification macro places that otherwise would cause build errors when Clang is used to drive the CUDA compilation.
Note: `__clang__` is defined when either Clang is used as host compiler by NVCC or when Clang drives the compilation. `__CUDA__` is defined only for the latter case.
Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
```
Reviewed By: BIT-silence
Differential Revision: D20007440
fbshipit-source-id: 53caa70695b99461a3910d41dc71a9f6d0728a75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33554
NVCC/GCC accepts the existing syntax, but not Clang which requires a proper escape. Here `%laneid` is one of the many registers that CUDA's pseudo-asm provides [1]. And using the extra `%` doesn't change the semantics, as PTX expects `%laneid` value after it's processed by the asm tool.
1. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
Test Plan:
```lang=bash
buck build mode/opt -c fbcode.cuda_use_clang=true //fblearner/flow/projects/dper:workflow
buck build mode/opt //fblearner/flow/projects/dper:workflow
Reviewed By: bddppq
Differential Revision: D20003621
fbshipit-source-id: 8e550e55a3455925e7bd92c6df3e504b5d38c2dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33523
When using `ThreadPool::setNumThreads` to set the number of threads, it should not exceed the number of big cores. Otherwise, the performance could degrade significantly.
Test Plan:
```
cd ~/fbsource/xplat
buck test caffe2:caffe2_testAndroid
```
Reviewed By: dreiss
Differential Revision: D19779267
fbshipit-source-id: 4e980e8a0ccc2f37e1c8ed16e2f4651d72924dbd
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31911
Test Plan:
* CI builds including GPU and OSS-build tests
* The `defined(__HIP_DEVICE_COMPILE__) ` instance a few lines below is proof that this is a define/undef flag, not a define01 flag
Reviewed By: hlu1
Differential Revision: D19296560
fbshipit-source-id: 1c45069aec534b0bf4a87751a74680675c985e06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30915
Since we now have C++14, we don't need these c10::guts helpers anymore
ghstack-source-id: 95777609
Test Plan: waitforsandcastle
Differential Revision: D18869639
fbshipit-source-id: 97716f932297c64c6e814410ac47b444c33d4e2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29885
### Summary
Currently, we have a deadlock issue on iOS when running Resnet50. The problem happens when the task being run in the ThreadPool wants to call `getNumThread()` who will try to acquire the same mutex. And thus cause the deadlock situation. The fix is just remove the guard for `_numThreads`, as it's not likely to change after initialization.
### Test Plan
1. Generate a Resnet50 model using trace_model.py
2. Run `ios/TestApp/bootstrap.sh` to do the benchmark
cc shoumikhin AshkanAliabadi
Test Plan: Imported from OSS
Differential Revision: D18533505
Pulled By: xta0
fbshipit-source-id: 2a069d20b59833ec8b02ff05515c3739a85a15de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27001
This unconditional log line spams the logs enough that it's a drag on cpu and will eventually fill up logs.
Test Plan: Allow unit test and automated testing to give feedback.
Reviewed By: jspark1105
Differential Revision: D17638140
fbshipit-source-id: 4e8a44bda31327ba7e797f7579a9e3bf866eef7e
Summary:
Rename old mobile_threadpool() API, replace it with a new version that
returns caffe2::ThreadPool instead of pthreadpool_t.
Test Plan: - builds
Differential Revision: D17543413
Pulled By: ljk53
fbshipit-source-id: a3effd24e8ce9d677a2a04ebe6b6e1582e6f0a65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25650
This PR removes protobuf dependencies from mobile build altogether:
- caffe2/proto: protobuf files, including caffe2.proto and torch.proto;
- caffe2 components that depend on caffe2.proto, including most part of
caffe2/core, caffe2/utils;
- libprotobuf / libprotobuf-lite dependencies;
- protobuf compiler;
- some utils class, e.g.: netdef_converter.cpp;
- introduce a macro to disable third_party/onnx which depends on protobuf;
Test Plan:
- builds;
- link with demo app to make sure it can load and run a model in pickle format;
Differential Revision: D17183548
Pulled By: ljk53
fbshipit-source-id: fe60b48674f29c4a9b58fd1cf8ece44191491531
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25671
To decouple string_utils.h from types.h and protobuf headers.
Logically GetDimFromOrderString seems to be more similiar to
StringToStorageOrder comparing to other string_utils functions.
Test Plan: - Will check all internal/external CI jobs.
Reviewed By: yinghai
Differential Revision: D17191912
Pulled By: ljk53
fbshipit-source-id: fe555feef27bfd74c92b6297c12fb668252ca9ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602
Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.
Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.
Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.
Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.
Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864
Reviewed By: xw285cornell
Differential Revision: D16940768
Pulled By: bddppq
fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23658
**How things work for caffe2:**
Caffe2 Ops -> NNPACK/QNNPACK -> pthreadpool_compute_1/2/3/4d_tiled -> pthreadpool_compute_1d (caffe2 shim) -> caffe2::ThreadPool
**Before this PR:**
Pytorch Ops -> NNPACK/QNNPACK -> pthreadpool_compute_1/2/3/4d_tiled -> pthreadpool_compute_1d (third_party implementation without mobile optimization)
caffe2::ThreadPool is optimized for mobile. This change leverages this logic for pytorch mobile as a temporary solution improve pytorch mobile perf. It is guarded by the C10_MOBILE macro.
For server side we return nullptr.
**Plan for next steps:**
Implement a mobile version of "at::parallel_for" which uses caffe2::ThreadPool internally so all ATen/TH multithreading usage is mobile optimized.
Refactor QNNPACK and/or pthreadpool to explicitly using "at::parallel_for" primitive to replace pthreadpool_compute_1d for Pytorch.
After QNNPACK is refactored, we will delete the mobile_threadpool() API.
ghstack-source-id: 88073396
Reviewed By: dreiss
Differential Revision: D16594020
fbshipit-source-id: 9f94600756d5f86d24a12a2fd7df3eebd0994f1d
Summary:
These implicit fallthroughs lead to the following warning on g++ 7, because g++ could not recognize the implicit `abort` call in `LOG(FATAL)`. We suppress by adding explicit `return`s.
/home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc: In function void
caffe2::math::GemmEx(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int
, int, int, T, const T*, int, const T*, int, T, T*, int, Context*) [with
T = float; Context = caffe2::CPUContext; Engine = caf
fe2::DefaultEngine]:
/home/hong/wsrc/pytorch/c10/util/logging_is_not_google_glog.h:98:10:
warning: this statement may fall through [-Wimplicit-fall
through=]
::c10::MessageLogger((char*)__FILE__, __LINE__, n).stream()
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:179:11: note: in
expansion of macro LOG
LOG(FATAL) << "Unexpected CBLAS_TRANSPOSE for trans_B";
^
/home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:182:5: note: here
case CblasTrans: {
^~~~
In file included from /home/hong/wsrc/pytorch/c10/util/Logging.h:28:0,
from /home/hong/wsrc/pytorch/caffe2/core/logging.h:2,
from /home/hong/wsrc/pytorch/caffe2/core/types.h:9,
from /home/hong/wsrc/pytorch/caffe2/utils/math.h:17,
from
/home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:14:
/home/hong/wsrc/pytorch/c10/util/logging_is_not_google_glog.h:98:10:
warning: this statement may fall through [-Wimplicit-fall
through=]
::c10::MessageLogger((char*)__FILE__, __LINE__, n).stream()
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:202:11: note: in
expansion of macro LOG
LOG(FATAL) << "Unexpected CBLAS_TRANSPOSE for trans_B";
^
/home/hong/wsrc/pytorch/caffe2/utils/math_cpu.cc:205:5: note: here
default:
^~~~~~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24053
Differential Revision: D16732530
Pulled By: ezyang
fbshipit-source-id: 90373879f25b52efca5bf151c7ed58d6ad19d925