Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46356
Adding the flag `-Werror=cast-function-type` to ensure we don't allow
any invalid casts (ex: PyCFunction casts).
For more details see: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 114632980
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D24319759
fbshipit-source-id: 26ce4650c220e8e9dd3550245f214c7e6c21a5dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46383
The old `USE_METAL` is actually being used by Caffe2. Here we introduce a new macro to enable metal in pytorch.
ghstack-source-id: 114499392
Test Plan:
- Circle CI
- The Person Segmentation model works
Reviewed By: linbinyu
Differential Revision: D24322018
fbshipit-source-id: 4e5548afba426b49f314366d89b18ba0c7e745ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116
Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource.
Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND`
Reviewed By: malfet
Differential Revision: D24227360
fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112
### Summary
This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta.
allow-large-files
- Users API
```
auto module = torch::jit::load(model);
module.eval();
at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal();
auto output = module.forward({input}).toTensor().cpu();
```
- Supported Models
- Person Segmentation v106 (FB Internal)
- Mobilenetv2
- Supported Operators
- aten::conv2d
- aten::addmm
- aten::add.Tensor
- aten::sub.Tensor
- aten::mul.Tensor
- aten::relu
- aten::hardtanh
- aten::hardtanh_
- aten::sigmoid
- aten::max_pool2d
- aten::adaptive_avg_pool2d
- aten::reshape
- aten::t
- aten::view
- aten::log_softmax.int
- aten::upsample_nearest2d.vec
- Supported Devices
- Apple A9 and above
- iOS 10.2 and above
- CMake scripts
- `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON`
### Test Plan
- Circle CI
ghstack-source-id: 114155638
Test Plan:
1. Sandcastle CI
2. Circle CI
Reviewed By: dreiss
Differential Revision: D23236555
fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625
Summary:
CentOS 8 on AArch64 has vld1_* intrinsics but lacks vst1q_f32_x2 one.
This patch checks for it and handle it separately to vld1_* ones.
Fixes https://github.com/pytorch/pytorch/issues/44198
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44199
Reviewed By: seemethere
Differential Revision: D23641273
Pulled By: malfet
fbshipit-source-id: c2053c8e0427705eaeeeb82ec030925bff22623a
Summary:
According to [documentation](https://github.com/pytorch/pytorch/blob/master/tools/setup_helpers/cmake.py#L265), only options starts with `BUILD_` / `USE_` / `CMAKE_` in `CMakeLists.txt` can be imported by environment variables.
---
This diff is originally intended to enable `c++` source coverage with `CircleCI` and `codecov.io`, but we will finish it in the future. You can find the related information in the diff history. Following is the originally procedur:
Based on [this pull request](1bda5e480c), life becomes much easier for this time.
1.in `build.sh`
- Enable coverage builld option for c++
- `apt-get install lcov`
2.in `test.sh`
- run `lcov`
3.in `pytorch-job-specs.yml`
- copy coverage.info to `test/` folder and upload it to codecov.io
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43999
Test Plan: Test on github
Reviewed By: malfet
Differential Revision: D23464656
Pulled By: scintiller
fbshipit-source-id: b2365691f04681d25ba5c00293fbcafe8e8e0745
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564
Static dispatch was originally introduced for mobile selective build.
Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23324452
Pulled By: ljk53
fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43154
Adds the build flag `BUILD_MOBILE_AUTOGRAD` which toggles whether autograd files should be included for a PyTorch mobile build (default off).
ghstack-source-id: 110369406
Test Plan: CI
Reviewed By: ljk53
Differential Revision: D23061913
fbshipit-source-id: bc3d6683ab17f158990d83e4fae0a011d5adeca1
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39968
tested with `TORCH_CUDA_ARCH_LIST='3.5 5.2 6.0 6.1 7.0 7.5 8.0+PTX'`, before this PR, it was failing, and with this PR, the build succeed.
With `TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0+PTX'`, `libtorch_cuda.so` with symbols changes from 2.9GB -> 2.2GB
cc: ptrblck mcarilli jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43074
Reviewed By: mrshenli
Differential Revision: D23176095
Pulled By: malfet
fbshipit-source-id: 7b3e6d049fc080e519f21e80df05ef68e7bea57e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42837
Originally we use
```
list(APPEND CMAKE_C_FLAGS -fprofile-instr-generate -fcoverage-mapping)
list(APPEND CMAKE_CXX_FLAGS -fprofile-instr-generate -fcoverage-mapping)
```
But when compile project on mac with Coverage On, it has the error:
`clang: error: no input files
/bin/sh: -fprofile-instr-generate: command not found
/bin/sh: -fcoverage-mapping: command not found`
The reason behind it, is `list(APPEND CMAKE_CXX_FLAGS` will add an additional `;` to the variable. This means, if we do `list(APPEND foo a)` and then `list(APPEND foo b)`, then `foo` will be `a;b` -- with the additional `;`. Since we have `CMAKE_CXX_FLAGS` defined before in the `CMakeList.txt`, we can only use `set(...)` here
After changing it to
```
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
```
Test successufully in local mac machine.
Test Plan: Test locally on mac machine
Reviewed By: malfet
Differential Revision: D23043057
fbshipit-source-id: ff6f4891b35b7f005861ee2f8e4c550c997fe961
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40179
- Pass no-psabi to shut up GCC about # Suppress "The ABI for passing
parameters with 64-byte alignment has changed in GCC 4.6"
- Fix use of deprecated data() accessor (and minor optimization: hoist
accessor out of loop)
- Undeprecate NetDef.num_workers, no one is serious about fixing these
- Suppress warnings about deprecated pthreadpool types
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22234138
Pulled By: ezyang
fbshipit-source-id: 6a1601b6d7551a7e6487a44ae65b19acdcb7b849
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39341
This PR introduces neon backend for vec256 class for float datatype.
For now only aarch64 is enabled due to few issues with enabling in
aarch32 bit.
Test Plan:
vec256_test
Imported from OSS
Differential Revision: D21822399
fbshipit-source-id: 3851c4336d93d1c359c85b38cf19904f82bc7b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40059
This benchmark is added specifically for mobile to see if compiler is
autovectorizing and thus we have no advantage of neon backend for vec256
for add op.
Test Plan:
CI
Imported from OSS
Differential Revision: D22055146
fbshipit-source-id: 43ba6c4ae57c6f05d84887c2750ce21ae1b0f0b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41103
add a CLANG_CODE_COVERAGE option to CMakeList. If the option is ON, add code coverage needed compile flags.
Test Plan:
Clone pytorch source code to local, modified these changes and builded it with `CLANG_CODE_COVERAGE ON` and `BUILD_TESTS ON`. Run a manual test and attach code coverage report.
{F243609020}
Reviewed By: malfet
Differential Revision: D22422513
fbshipit-source-id: 27a31395c31b5b5f4b72523954722771d8f61080
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.
Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`
Reviewed By: xcheng16
Differential Revision: D22199952
fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37243
*** Why ***
As it stands, we have two thread pool solutions concurrently in use in PyTorch mobile: (1) the open source pthreadpool library under third_party, and (2) Caffe2's implementation of pthreadpool under caffe2/utils/threadpool. Since the primary use-case of the latter has been to act as a drop-in replacement for the third party version so as to enable integration and usage from within NNPACK and QNNPACK, Caffe2's implementation is intentionally written to the exact same interface as the third party version.
The original argument in favor of C2's implementation has been improved performance as a result of using spin locks, as opposed to relinquishing the thread's time slot and putting it to sleep - a less expensive operation up to a point. That seems to have given C2's implementation the upper hand in performance, hence justifying the added maintenance complexity, until the third party version improved in parallel surpassing the efficiency of C2's implementation as I have verified in benchmarks. With that advantage gone, there is no reason to continue using C2's implementation in PyTorch mobile either from the perspective of performance or code hygiene. As a matter of fact, there is considerable performance benefit to be had as a result of using the third party version as it currently stands.
This is a tricky change though, mainly because in order to avoid potential performance regressions, of which I have witnessed none but just in abundance of caution, we have decided to continue using the internal C2's implementation whenever building for Caffe2. Again, this is mainly to avoid potential performance regressions in production C2 use cases even if doing so results in reduced performance as far as I can tell.
So to summarize, today, and as it currently stands, we are using C2's implementation for (1) NNPACK, (2) PyTorch QNNPACK, and (3) ATen parallel_for on mobile builds, while using the third party version of pthreadpool for XNNPACK as XNNPACK does not provide any build options to link against an external implementation unlike NNPACK and QNNPACK do.
The goal of this PR then, is to unify all usage on mobile to the third party implementation both for improved performance and better code hygiene. This applies to PyTorch's use of NNPACK, QNNPACK, XNNPACK, and mobile's implementation of ATen parallel_for, all getting routed to the
exact same third party implementation in this PR.
Considering that NNPACK, QNNPACK, and XNNPACK are not mobile specific, these benefits carry over to non-mobile builds of PyTorch (but not Caffe2) as well. The implementation of ATen parallel_for on non-mobile builds remains unchanged.
*** How ***
This is where things get tricky.
A good deal of the build system complexity in this PR arises from our desire to maintain C2's implementation intact for C2's use.
pthreadpool is a C library with no concept of namespaces, which means two copies of the library cannot exist in the same binary or symbol collision will occur violating ODR. This means that somehow, and based on some condition, we must decide on the choice of a pthreadpool implementation. In practice, this has become more complicated as a result of all the possible combinations that USE_NNPACK, USE_QNNPACK, USE_PYTORCH_QNNPACK, USE_XNNPACK, USE_SYSTEM_XNNPACK, USE_SYSTEM_PTHREADPOOL and other variables can result in. Having said that, I have done my best in this PR to surgically cut through this complexity in a way that minimizes the side effects, considering the significance of the performance we are leaving on the table, yet, as a result of this combinatorial explosion explained above I cannot guarantee that every single combination will work as expected on the first try. I am heavily relying on CI to find any issues as local testing can only go that far.
Having said that, this PR provides a simple non mobile-specific C++ thread pool implementation on top of pthreadpool, namely caffe2::PThreadPool that automatically routes to C2's implementation or the third party version depending on the build configuration. This simplifies the logic at the cost of pushing the complexity to the build scripts. From there on, this thread pool is used in aten parallel_for, and NNPACK and family, again, routing all usage of threading to C2 or third party pthreadpool depending on the build configuration.
When it is all said or done, the layering will look like this:
a) aten::parallel_for, uses
b) caffe2::PThreadPool, which uses
c) pthreadpool C API, which delegates to
c-1) third_party implementation of pthreadpool if that's what the build has requested, and the rabbit hole ends here.
c-2) C2's implementation of pthreadpool if that's what the build has requested, which itself delegates to
c-2-1) caffe2::ThreadPool, and the rabbit hole ends here.
NNPACK, and (PyTorch) QNNPACK directly hook into (c). They never go through (b).
Differential Revision: D21232894
Test Plan: Imported from OSS
Reviewed By: dreiss
Pulled By: AshkanAliabadi
fbshipit-source-id: 8b3de86247fbc3a327e811983e082f9d40081354
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39584
Removing `-DNO_EXPORT` for not-custom-build to be able to link to C10/A10 api.
Custom build stays the same as its main goal is to have minimum binary size, while export api functions will increase it.
Additional changes:
1. aten/src/ATen/DynamicLibrary.cpp uses libdl, if we need this functionality we will need to link result with libdl, but currently disabling this functionality for mobile.
Test Plan: Imported from OSS
Differential Revision: D22111600
Pulled By: IvanKobzarev
fbshipit-source-id: d730201c55f543c959a596b34be532aecee6b9ab
Summary:
Switch off `/Z7` so that we don't generate debug info in Release and MinSizeRel builds, so that we will probably get smaller static libraries and object files and faster build time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39703
Differential Revision: D21960684
Pulled By: ezyang
fbshipit-source-id: 909a237a138183591d667885b13fc311470eed65
Summary:
According to
<https://gitlab.kitware.com/cmake/cmake/-/blob/master/Modules/Compiler/MSVC-C.cmake>,
the option simply has no effect for MSVC as of today. It is better to not impose
such an if condition as it is a bit misleading (the current code makes it look like we have compatibility issues with MSVC C11 support), and also it's better to
leave the judgment of MSVC C support to CMake devs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39304
Differential Revision: D21846032
Pulled By: malfet
fbshipit-source-id: 962e5721da3d7b9be4117b42bdc35df426b7da7b
Summary:
This PR contains the initial version of Vulkan (GPU) Backend integration.
The primary target environment is Android, but the desktop build is also supported.
## CMake
Introducing three cmake options:
USE_VULKAN:
The main switch, if it is off, all other options do not affect.
USE_VULKAN_WRAPPER:
ON - Vulkan will be used loading it at runtime as "libvulkan.so" using libdl, every function call is wrapped in vulkan_wrapper.h.
OFF - linking with libvulkan.so directly
USE_VULKAN_SHADERC_RUNTIME:
ON - Shader compilation library will be linked, and shaders will be compiled runtime.
OFF - Shaders will be precompiled and shader compilation library is not included.
## Codegen
if `USE_VULKAN_SHADERC_RUNTIME` is ON:
Shaders precompilation () starts in cmake/VulkanCodegen.cmake, which calls `aten/src/ATen/native/vulkan/gen_glsl.py` or `aten/src/ATen/native/vulkan/gen_spv.py` to include shaders source or SPIR-V bytecode inside binary as uint32_t array in spv.h,spv.cpp.
if `USE_VULKAN_SHADERC_RUNTIME` is OFF:
The source of shaders is included as `glsl.h`,`glsl.cpp`.
All codegen results happen in the build directory.
## Build dependencies
cmake/Dependencies.cmake
If the target platform is Android - vulkan library, headers, Vulkan wrapper will be used from ANDROID_NDK.
Desktop build requires the VULKAN_SDK environment variable, and all vulkan dependencies will be used from it.
(Desktop build was tested only on Linux).
## Pytorch integration:
Adding 'Vulkan" as new Backend, DispatchKey, DeviceType.
We are using Strided layout without supporting strides at the moment, but we plan to support them in the future.
Using OpaqueTensorImpl where OpaqueHandle is copyable VulkanTensor,
more details in comments in `aten/src/ATen/native/vulkan/Vulkan.h`
Main code location: `aten/src/ATen/native/vulkan`
`aten/src/ATen/native/vulkan/VulkanAten.cpp` - connection link between ATen and Vulkan api (Vulkan.h) that converts at::Tensor to VulkanTensor.
`aten/src/ATen/native/Vulkan/Vulkan.h` - Vulkan API that contains VulkanTensor representation and functions to work with it. Plan to expose it for clients to be able to write their own Vulkan Ops.
`aten/src/ATen/native/vulkan/VulkanOps.cpp` - Vulkan Operations Implementations that uses Vulkan.h API
## GLSL shaders
Located in `aten/src/ATen/native/vulkan/glsl` as *.glsl files.
All shaders use Vulkan specialized constants for workgroup sizes with ids 1, 2, 3
## Supported operations
Code point:
conv2d no-groups
conv2d depthwise
addmm
upsample nearest 2d
clamp
hardtanh
## Testing
`aten/src/ATen/test/vulkan_test.cpp` - contains tests for
copy from CPU to Vulkan and back
all supported operations
Desktop builds supported, and testing can be done on a desktop that has Vulkan supported GPU or with installed software implementation of Vulkan, like https://github.com/google/swiftshader
## Vulkan execution
The initial implementation is trivial and waits every operator's execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36491
Differential Revision: D21696709
Pulled By: IvanKobzarev
fbshipit-source-id: da3e5a770b1a1995e9465d7e81963e7de56217fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38565
Also note this turns on "-Wno-unused-local-typedefs" because we are using dispatch macros for error checking.
Test Plan: Imported from OSS
Differential Revision: D21598478
Pulled By: gchanan
fbshipit-source-id: 28f9ad01bd678df0601a10d0daf3ed31c47c4ab2
Summary:
Right now it is an unused alias to `torch_library` interface library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38408
Differential Revision: D21598250
Pulled By: malfet
fbshipit-source-id: ec9a2446b94e7ea68298831212005c2c80bbc95c
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26304
Test procedure:
With ninja:
[x] Build a clean checkout
[x] Build again. Result: Only 10 libraries are (needlessly) linked again, the extra delay on a 24-core machine is <10s.
[x] Build for the third time. Result: Virtually instantaneous, with no extra rebuilding.
[x] Modify DispatchTable.h. Build again. Result: `.cu` files are rebuilt, as well as many `.cpp` files
[x] Build for the fifth time. Result: Virtually instantaneous, with no extra rebuilding.
[x] Touch one of the `.depend` files. Build again. Result: Only 10 libraries are (needlessly) linked again, the extra delay on a 24-core machine is <10s.
Without ninja:
[x] Build a clean checkout
[x] Build again. Result: There is some unnecessary rebuilding. But it was also happening before this change.
[x] Build for the third time. Result: Virtually instantaneous, with no extra rebuilding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37661
Differential Revision: D21434624
Pulled By: ezyang
fbshipit-source-id: 379d2315486b8bb5972c184f9b8da8e00d38c338
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37721
Even though we disabled caffe2 test configs in Python, the BUILD_TEST
option was still building caffe2 test cpp binaries and various CI
configurations were running them (since they just run every binary in
`torch/test`).
This PR adds a caffe2-specific BUILD_TEST option (BUILD_CAFFE2_TEST),
which defaults to OFF, and gates the compilation of caffe2 test cpp
binaries under it.
Test Plan: Imported from OSS
Differential Revision: D21369541
Pulled By: suo
fbshipit-source-id: 669cff70c5b53f016e8e016bcb3a99bf3617e1f9
Summary:
This is useful for linux distributions when the ABI/API of libtorch has
been changed. The default SOVERSION is set to
"${TORCH_VERSION_MAJOR}.${TORCH_VERSION_MINOR}".
ezyang
But if the release strategy of pytorch/caffe2 involves avoiding breaking API/ABI changes to libtorch for minor/patch releases, then we can set `TORCH_SOVERSION` to simply `TORCH_VERSION_MAJOR`. Please confirm that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37502
Differential Revision: D21303565
Pulled By: ezyang
fbshipit-source-id: 798f5ec7fc5f0431ff1a7f9e8e5d3a0d3b25bb22
Summary:
We should not rely on the async exceptions. Catching C++ only exception is more sensible and may get a boost in both space (1163 MB -> 1073 MB, 0.92x) and performance(51m -> 49m, 0.96x).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37235
Differential Revision: D21256918
Pulled By: ezyang
fbshipit-source-id: 572ee96f2e4c48ad13f83409e4e113483b3a457a
Summary:
These options are disabled by default, and are supposed to be used by
linux distro developers. With the existing shortcut option
USE_SYSTEM_LIBS toggled, these new options will be enabled as well.
Additionally, when USE_SYSTEM_LIBS is toggled, setup.py should
no longer check the existence of git submodules.
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37277
Differential Revision: D21256999
Pulled By: ezyang
fbshipit-source-id: 84f97d008db5a5e41a289cb7bce94906de3c52cf