# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985
### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao
### Changes Made
The majority of the changes in this pull request involve:
- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.
### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)
There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.
### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108
### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes
### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985
### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao
### Changes Made
The majority of the changes in this pull request involve:
- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.
### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)
There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.
### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108
### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes
### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
Summary:
This stack of PR's integrates cuSPARSELt into PyTorch.
This PR adds support for cuSPARSELt into the build process.
It adds in a new flag, USE_CUSPARSELT that defaults to false.
When USE_CUSPASRELT=1 is specified, the user can also specify
CUSPASRELT_ROOT, which defines the path to the library.
Compiling pytorch with cusparselt support can be done as follows:
``
USE_CUSPARSELT=1
CUSPARSELT_ROOT=/path/to/cusparselt
python setup.py develop
```
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103700
Approved by: https://github.com/albanD
This PR enables `-Winconsistent-missing-destructor-override` and `-Winconsistent-missing-override`
and fixes violations.
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 47e904e</samp>
This pull request updates the code of various classes and operators in the `caffe2` and `aten` subdirectories to use the `override` specifier instead of the `virtual` keyword for destructors and other virtual functions that override a base class function. This improves the code readability, quality, and consistency with C++ best practices. It also modifies the `./CMakeLists.txt` file to enable warnings for these specifiers, but disable errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104032
Approved by: https://github.com/malfet
To avoid nvcc segfaults, compile without `--source-in-ptx` option on CUDA-12.1+
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 984e4b2</samp>
> _Sing, O Muse, of the daring deeds of PyTorch, the swift and fiery_
> _framework that harnesses the power of CUDA, the blazing tool of Nvidia._
> _How they faced a mighty challenge when CUDA, the ever-shifting,_
> _released a new version, twelve point one, that broke their code and caused them grief._
Fixes https://github.com/pytorch/pytorch/issues/102372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102756
Approved by: https://github.com/atalman
Use `append_cxx_flag_if_supported` to determine whether or not `-Werror` is supported
Do not suppress deprecation warnings if glog is not used/installed, as the way check is written right now, it will suppress deprecations even if `glog` is not installed.
Similarly, do not suppress deprecations on MacOS simply because we are compiling with protobuf.
Fix deprecation warnings in:
- MPS by replacing `MTLResourceOptionCPUCacheModeDefault`->`MTLResourceCPUCacheModeDefaultCache`
- In GTests by replacing `TYPED_TEST_CASE`->`TYPED_TEST_SUITE`
- In `codegen/onednn/interface.cpp`, by using passing `Stack` by reference rathern than pointer.
Do not guard calls to `append_cxx_flag_if_supported` with `if(CLANG)` or `if(GCC)`.
Fix some deprecated calls in `Metal` hide more complex exception under `C10_CLANG_DIAGNOSTIC_IGNORE`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97584
Approved by: https://github.com/kit1980
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at b07152e</samp>
This pull request refactors the CMake configuration to enable the `USE_FLASH_ATTENTION` feature for the `torch_cuda` target only, using a target-specific macro. This avoids conflicts with other libraries that also use this feature, such as fairseq.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97579
Approved by: https://github.com/kit1980
remove unused CAFFE2_VERSION macros
Summary:
Nothing reads these and they are completely subsumed by TORCH_VERSION.
Getting rid of these will be helpful for build unification, since they
are also not used internally.
Test Plan: Rely on CI.
Reviewers: sahanp
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97337
Approved by: https://github.com/malfet
This PR do two things:
1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang.
2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927
Approved by: https://github.com/malfet
Currently there is a potential conflict for `GLIBCXX_USE_CXX11_ABI` configuration if users don't explicitly set this variable.
In `caffe2/CMakeLists.txt`, if the variable is not set, an `abi checker` will be used to retrieve the ABI configuration from compiler.
https://github.com/pytorch/pytorch/blob/master/caffe2/CMakeLists.txt#L1165-L1183
However, in 'torch/csrc/Module.cpp`, if the variable is not set, it will be set to `0`. The conflict happens when the default ABI of the compiler is `1`.
https://github.com/pytorch/pytorch/blob/master/torch/csrc/Module.cpp#L1612
This PR eliminate this uncertainty and potential conflict.
The ABI will be checked and set in `CMakeLists.txt`, and pass the value to `caffe2/CMakeLists.txt`. Meanwhile, in case the `caffe2/CMakeLists.txt` is directly invoked from a `cmake` command, The original GLIBC check logic is kept in this file.
If users doesn't explicitly assign a value to `GLIBCXX_USE_CXX11_ABI`, the `abi checker` will be executed and set the value accordingly. If the `abi checker` failed to compile or execute, the value will be set to `0`. If users explicitly assigned a value, then the provided value will be used.
Moreover, if `GLIBCXX_USE_CXX11_ABI` is set to `0`, the '-DGLIBCXX_USE_CXX11_ABI=0' flag won't be appended to `CMAKE_CXX_FLAGS`. Thus, whether to use ABI=0 or ABI=1 fully depends on compiler's default configuration. It could cause an issue that even users explicitly set `GLIBCXX_USE_CXX11_ABI` to `0`, the compiler still builds the binaries with ABI=1.
https://github.com/pytorch/pytorch/blob/master/CMakeLists.txt#L44-L51
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94306
Approved by: https://github.com/malfet
This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library.
Contents inside this PR:
1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp)
2. splits the build system so nvfuser is generating its own `.so` files. Currently there are:
- `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser
- `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser`
3. nvfuser cpp tests is currently being compiled into `nvfuser_tests`
4. cmake is refactored so that:
- nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`.
- nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more
- nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built.
- since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary`
Future work that's scoped in following PR:
- Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet
- Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621
Approved by: https://github.com/davidberard98
# Summary
Add support for fused attention kernels (FlashAttention and memory-efficient attention) on Windows. Previously we could not do this because the fixes required c++17 to do this but we have since update the PyTorch standard.
This PR:
- Changes invocations of unsigned long to the fixed width integer type
- Adds in the #define FP16_SWITCH(COND, ...) which has been added to the flash_attention main branch
- Changes the some macros used within mem-efficient attention code in order to work around the VA_ARG discrepancy between clang/gcc and msvc. An alternative would be setting the global flag Zc:preprocessor
- Selectively applies /Zc:lambda to only the mem-efficient sources since applying this globally caused quantization files to not compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91909
Approved by: https://github.com/cpuhrsch
I'm at a loss to explain why this happens, but not setting the manifest file explicitly in the linker fixes it.
### Testing locally
* With `/MANIFESTFILE:bin\torch_python.dll.manifest`
```
C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_python.dll.manifest
LINK : fatal error LNK1000: Internal error during CImplib::EmitImportThunk
```
* Work fine without the flag
```
C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST
```
In both case, the `/MANIFEST` flag is set, so the manifest file is there. In the latter case, the filename comes by appending `.manifest` suffix to `bin\torch_python.dll`. Thus, it's still correctly be `bin\torch_python.dll.manifest`. Weird.
```
C:\actions-runner\_work\pytorch\pytorch>ls -la build/bin/torch_*
-rwxr-xr-x 1 runneruser 197121 246796288 Jan 11 04:30 build/bin/torch_cpu.dll
-rw-r--r-- 1 runneruser 197121 381 Jan 11 04:26 build/bin/torch_cpu.dll.manifest
-rwxr-xr-x 1 runneruser 197121 9728 Jan 11 03:55 build/bin/torch_global_deps.dll
-rw-r--r-- 1 runneruser 197121 381 Jan 11 03:55 build/bin/torch_global_deps.dll.manifest
-rwxr-xr-x 1 runneruser 197121 11746816 Jan 11 04:31 build/bin/torch_python.dll
-rw-r--r-- 1 runneruser 197121 381 Jan 11 04:30 build/bin/torch_python.dll.manifest
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91988
Approved by: https://github.com/malfet, https://github.com/Blackhex, https://github.com/ZainRizvi
The main changes are:
1. Remove outdated checks for old compiler versions because they can't support C++17.
2. Remove outdated CMake checks because it now requires 3.18.
3. Remove outdated CUDA checks because we are moving to CUDA 11.
Almost all changes are in CMake files for easy audition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90599
Approved by: https://github.com/soumith
## Job
Test running on most CI jobs.
## Test binary
* `test_main.cpp`: entry for gtest
* `test_operator_registration.cpp`: test cases for gtest
## Helper sources
* `operator_registry.h/cpp`: simple operator registry for testing purpose.
* `Evalue.h`: a boxed data type that wraps ATen types, for testing purpose.
* `selected_operators.yaml`: operators Executorch care about so far, we should cover all of them.
## Templates
* `NativeFunctions.h`: for generating headers for native functions. (not compiled in the test, since we will be using `libtorch`)
* `RegisterCodegenUnboxedKernels.cpp`: for registering boxed operators.
* `Functions.h`: for declaring operator C++ APIs. Generated `Functions.h` merely wraps `ATen/Functions.h`.
## Build files
* `CMakeLists.txt`: generate code to register ops.
* `build.sh`: driver file, to be called by CI job.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89596
Approved by: https://github.com/ezyang
Set `cmake.dir` to `/usr/local` in `.circleci/scripts/build_android_gradle.sh `
Prep change for raising compiler standard to C++17: cmake-3.18 is the first one to support CUDA17 language
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89570
Approved by: https://github.com/atalman
Adds `/FS` option to `CMAKE_CXX_FLAGS` and `CMAKE_CUDA_FLAGS`.
So far I've encountered this kind of errors:
```
C:\Users\MyUser\AppData\Local\Temp\tmpxft_00004728_00000000-7_cuda.cudafe1.cpp: fatal error C1041: cannot open program database 'C:\Projects\pytorch\build\third_party\gloo\gloo\CMakeFiles\gloo_cuda.dir\vc140.pdb'; if multiple CL.EXE write to the same .PDB file, please use /FS
```
when building with VS 2022.
cc @peterjc123 @mszhanyi @skyline75489 @nbcsm
Related issues:
- https://github.com/pytorch/pytorch/issues/87691
- https://github.com/pytorch/pytorch/issues/39989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88084
Approved by: https://github.com/ezyang
As we are linking with cuDNN and cuBLAS dynamically for all configs anyway, as statically linked cuDNN is different library than dynamically linked one, increases default memory footprint, etc, and libtorch_cuda even if compiled for all GPU architectures is no longer approaching 2Gb binary size limit, so BUILD_SPLIT_CUDA can go away.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87502
Approved by: https://github.com/atalman
- `vector<T>({0})` would give you the vector(size, ...) ctor and produce an empty vector of T, along with the scalar-init warning
- `vector<T>({T(0)})` would give you the vector of a single T(0) as you might have intended, and bypasses the warning/error
- the warning can easily be missed but can have serious consequences, so make it an error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86911
Approved by: https://github.com/albanD
Move functorch/functorch into `functorch` folder
- Add functorch/CMakeLists.txt that adds `functorch` native python exension
- Modify `setup.py` to package pytorch and functorch together into a single wheel
- Modify `functorch.__version__` is not equal to that of `torch.__version__`
- Add dummy `functorch/setup.py` file for the projects that still want to build it
Differential Revision: [D39058811](https://our.internmc.facebook.com/intern/diff/D39058811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83464
Approved by: https://github.com/zou3519
# Summary:
- I added a new submodule Cutlass pointing to 2.10 release. The inclusion of flash_attention code should be gated by the flag: USE_FLASH_ATTENTION. This is defaulted to off resulting in flash to not be build anywhere. This is done on purpose since we don't have A100 machines to compile and test on.
- Only looked at CMake did not attempt bazel or buck yet.
- I included the mha_fwd from flash_attention that has ben refactored to use cutlass 2.10. There is currently no backwards kernel on this branch. That would be a good follow up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81434
Approved by: https://github.com/cpuhrsch
Summary: Currently, build_mobile.sh doesn't allow lite interpreter builds or tracing based selective builds. build_mobile.sh is used for host builds of PyTorch for Mobile deployment.
Additionally, certain flags such as `USE_BLAS` were not being respected as they should be. This change addresses that as well.
Test Plan: Build using:
```
cat /tmp/selected_ops.yaml
- aten::add
- aten::sub
```
```
BUILD_PYTORCH_MOBILE_WITH_HOST_TOOLCHAIN=1 USE_LIGHTWEIGHT_DISPATCH=0 BUILD_LITE_INTERPRETER=1 SELECTED_OP_LIST=/tmp/selected_ops.yaml ./scripts/build_mobile.sh
```
```
cat /tmp/main.cpp
int main() {
auto m = torch::jit::_load_for_mobile("/tmp/path_to_model.ptl");
auto res = m.forward({});
return 0;
}
```
Test using:
```
g++ /tmp/main.cpp -L build_mobile/lib/ -I build_mobile/install/include/ -lpthread -lc10 -ltorch_cpu -ltorch -lXNNPACK -lpytorch_qnnpack -lcpuinfo -lclog -lpthreadpool -lgloo -lkineto -lfmt -ldl -lc10
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84647
Approved by: https://github.com/JacobSzwejbka, https://github.com/cccclai
We're no longer building Caffe2 mobile as part of our CI, and it adds a lot of clutter to our make files. Any lingering internal dependencies will use the buck build and so wont be effected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84338
Approved by: https://github.com/dreiss
And use it throughout the CMakeLists and rectify `IF(APPLE)`/`IF(GNU_CXX_VERSION VERSION_GREATER A.B)` and so on
Also, add `target_compile_options_if_supported` and use it in `Dependencies.cmake` as well as in test's `CMakeListst.txt`
Delete `-Wno-unknown-warning-option` to test that conditions indeed working as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82883
Approved by: https://github.com/seemethere
And use it throughout the CMakeLists and rectify `IF(APPLE)`/`IF(GNU_CXX_VERSION VERSION_GREATER A.B)` and so on
Also, add `target_compile_options_if_supported` and use it in `Dependencies.cmake` as well as in test's `CMakeListst.txt`
Delete `-Wno-unknown-warning-option` to test that conditions indeed working as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82883
Approved by: https://github.com/seemethere
RocksDB 7 starts to use C++17 in header.
We should make this configurable, in case user needs higher std version.
List of files to changed is found by `git grep 'CMAKE_[^_]*_STANDARD'`.
Doc string is from CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75519
Approved by: https://github.com/malfet
Fixes: #80489
Test using cuda 11.3 manywheel binary:
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11_BUILD_ABI)
````
Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
```
Functorch test torch : 1.13.0.dev20220707+cu113, functorch with cu102
```
import torch
print(torch.__version__)
print(torch._C._PYBIND11_BUILD_ABI)
from functorch import vmap
x = torch.randn(2, 3, 5)
vmap(lambda x: x, out_dims=3)(x)
```
Output
```
1.13.0.dev20220707+cu113
_cxxabi1011
/home/atalman/temp/testc1.py:5: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:73.)
x = torch.randn(2, 3, 5)
Traceback (most recent call last):
File "/home/atalman/temp/testc1.py", line 6, in <module>
vmap(lambda x: x, out_dims=3)(x)
File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 361, in wrapped
return _flat_vmap(
File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 488, in _flat_vmap
return _unwrap_batched(batched_outputs, out_dims, vmap_level, batch_size, func)
File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 165, in _unwrap_batched
flat_outputs = [
File "/home/atalman/conda/lib/python3.9/site-packages/functorch/_src/vmap.py", line 166, in <listcomp>
_remove_batch_dim(batched_output, vmap_level, batch_size, out_dim)
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```
Related Builder PR: https://github.com/pytorch/builder/pull/1083
Test PR: https://github.com/pytorch/pytorch/pull/81232
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81058
Approved by: https://github.com/zou3519, https://github.com/malfet
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.
Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.
Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:
$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded
Differential Revision: D36973688
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc
This PR relands #80584, but instead of adding suppression in CMakeLists.txt suppresses it directly in `llvm_codegen.cpp` and just for a single header.
In general, it's better to avoid `set_target_properties` pattern for suppressing warnings, as it makes build brittle and hard to debug/understand
Test plan: wait for `ciflow/binaries_wheel` to finish
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81012
Approved by: https://github.com/huydhn, https://github.com/kit1980
`xcrun --sdk macosx --show-sdk-version` works with either CommandLineTools or Xcode, but `xcodebuild -sdk macosx -version SDKVersion` works only if full Xcode is installed, which is not necessary to build PyTorch
Above command yield the same output when Xcode is installed:
```
% xcodebuild -sdk macosx -version SDKVersion
12.3
% xcrun --sdk macosx --show-sdk-version
12.3
```
But first one fails if Xcode is missing:
```
% xcodebuild -sdk macosx -version SDKVersion
xcode-select: error: tool 'xcodebuild' requires Xcode, but active developer directory '/Library/Developer/CommandLineTools' is a command line tools instance
% xcrun --sdk macosx --show-sdk-version
12.3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79430
Approved by: https://github.com/albanD
Re-landing #68111/#74596
## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of #50256, the below improvements are included:
* The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
* The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
* SkyLake 8180 (1 socket of 28 cores):

* SkyLake 8180 (single thread):

* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code is placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is in:
```
caffe2/CMakeLists.txt
cmake/public/mkldnn.cmake
cmake/Modules/FindMKLDNN.cmake
```
## Limitations
* In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
* We have only optimized the inference use-case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
This functionality does not seem to be used
and there are some requests to update dependency.
Add `third_party` to torch_cpu include directories if compiling with
Caffe2 support, as `caffe2/quantization/server/conv_dnnlowp_op.cc` depends on `third_party/fbgemm/src/RefImplementations.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
This is enabled on some of our internal builds, is a common source
of fbcode only errors and apparently we are relatively clean on it.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74996
Approved by: https://github.com/malfet
Summary:
This is the very first step for the UCC-NCCL integration. This PR lets `ProcessGroupNCCL` load the `torch_ucc.so` if the user specifies an environmental variable `TORCH_UCC_LIBRARY_PATH`. If this environment variable is not specified by the user, then there will be no visible change.
In the future, we may want to make PyTorch smart enough to automatically detect the `torch_ucc.so` in the user's system, but before doing that, I believe we should first make sure that `ProcessGroupUCC` is very well tested.
Note that in this PR, `ProcessGroupNCCL` just loads the library but will not use it. I am trying to make PRs small, so the usage of `torch_ucc.so` will be submitted in later PRs.
This PR requires the change in https://github.com/facebookresearch/torch_ucc/pull/56, otherwise `torch_ucc.so` can not be successfully loaded. But his PR can be landed separately without waiting for https://github.com/facebookresearch/torch_ucc/pull/56 because, in PyTorch's unit tests, UCC is never used or tested.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69552
Reviewed By: mruberry
Differential Revision: D34675212
Pulled By: jiayisuse
fbshipit-source-id: a3d1fb98340dbe3a931af555423863efd381f1ae
(cherry picked from commit 3778b6fabe70c26b5a65e6ddec641d2ef9113cd1)
Summary:
Also enables bazel build to run lazy codegen. Bazel (oss) build feeds off the same filelists as cmake/buck (build_variables.bzl), so enabling it is easier than keeping it disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74111
Test Plan: Run CI and verify test_lazy_ops is running via OSS cmake builds
Reviewed By: bdhirsh
Differential Revision: D34772403
fbshipit-source-id: 8a63f58b9536e6ac1be530667932176ef2549496
(cherry picked from commit e807ffb1918853d10b924fdc24f85ee5b1a39021)
Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols. Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74009
Approved by: https://github.com/malfet
Summary:
RFC: https://github.com/pytorch/rfcs/pull/40
This PR (re)introduces python codegen for unboxing wrappers. Given an entry of `native_functions.yaml` the codegen should be able to generate the corresponding C++ code to convert ivalues from the stack to their proper types. To trigger the codegen, run
```
tools/jit/gen_unboxing.py -d cg/torch/share/ATen
```
Merged changes on CI test. In https://github.com/pytorch/pytorch/issues/71782 I added an e2e test for static dispatch + codegen unboxing. The test exports a mobile model of mobilenetv2, load and run it on a new binary for lite interpreter: `test/mobile/custom_build/lite_predictor.cpp`.
## Lite predictor build specifics
1. Codegen: `gen.py` generates `RegisterCPU.cpp` and `RegisterSchema.cpp`. Now with this PR, once `static_dispatch` mode is enabled, `gen.py` will not generate `TORCH_LIBRARY` API calls in those cpp files, hence avoids interaction with the dispatcher. Once `USE_LIGHTWEIGHT_DISPATCH` is turned on, `cmake/Codegen.cmake` calls `gen_unboxing.py` which generates `UnboxingFunctions.h`, `UnboxingFunctions_[0-4].cpp` and `RegisterCodegenUnboxedKernels_[0-4].cpp`.
2. Build: `USE_LIGHTWEIGHT_DISPATCH` adds generated sources into `all_cpu_cpp` in `aten/src/ATen/CMakeLists.txt`. All other files remain unchanged. In reality all the `Operators_[0-4].cpp` are not necessary but we can rely on linker to strip them off.
## Current CI job test coverage update
Created a new CI job `linux-xenial-py3-clang5-mobile-lightweight-dispatch-build` that enables the following build options:
* `USE_LIGHTWEIGHT_DISPATCH=1`
* `BUILD_LITE_INTERPRETER=1`
* `STATIC_DISPATCH_BACKEND=CPU`
This job triggers `test/mobile/lightweight_dispatch/build.sh` and builds `libtorch`. Then the script runs C++ tests written in `test_lightweight_dispatch.cpp` and `test_codegen_unboxing.cpp`. Recent commits added tests to cover as many C++ argument type as possible: in `build.sh` we installed PyTorch Python API so that we can export test models in `tests_setup.py`. Then we run C++ test binary to run these models on lightweight dispatch enabled runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69881
Reviewed By: iseeyuan
Differential Revision: D33692299
Pulled By: larryliu0820
fbshipit-source-id: 211e59f2364100703359b4a3d2ab48ca5155a023
(cherry picked from commit 58e1c9a25e3d1b5b656282cf3ac2f548d98d530b)
Summary:
When enabled, it will generate `torch_cuda_linalg` library, which would depend on cusolve and magma and registers dynamic bindings to it from LinearAlgebraStubs
Avoid symbol clashes that can result in infinite recursion by moving all symbols in the library to its own namespace.
Add checks that should prevent calling self in recursion to `LinearAlgebraStubs.cpp`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73447
Reviewed By: albanD
Differential Revision: D34538827
Pulled By: malfet
fbshipit-source-id: f2535b471d3524768a84b2e169b6aa24c26c03bf
(cherry picked from commit 4ec24b079c861c1122f0fa86e280b977c3c2f7ac)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/73377
We've migrated to CUDA-11.3 as default toolkit in 1.9, it's time to stop builds (especially considering forward-compatibility guarantee across CUDA-11.x drivers)
Hence we are removing CUDA 11.1 support. We should also cleanup old cuda related code from our builder and pytorch repo making scripts a little more clean.
We have code that references cuda 9.2 , 10.1 , 11.0, 11.1, 11.2 and none of these are currently use
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73514
Reviewed By: janeyx99
Differential Revision: D34551989
Pulled By: atalman
fbshipit-source-id: 9ceaaa9b25ad49689986f4b29a26d20370d9d011
(cherry picked from commit fe109c62daf429e9053c03f6e374568ba23cd041)
Summary:
- Target Sha1: ae108ef49aa5623b896fc93d4298c49d1750d9ba
- Make USE_XNNPACK a dependent option on cmake minimum version 3.12
- Print USE_XNNPACK under cmake options summary, and print the
availability from collet_env.py
- Skip XNNPACK based tests when XNNPACK is not available
- Add SkipIfNoXNNPACK wrapper to skip tests
- Update cmake version for xenial-py3.7-gcc5.4 image to 3.12.4
- This is required for the backwards compatibility test.
The PyTorch op schema is XNNPACK dependent. See,
aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp for
example. The nightly version is assumed to have USE_XNNPACK=ON,
so with this change we ensure that the test build can also
have XNNPACK.
- HACK: skipping test_xnnpack_integration tests on ROCM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72642
Reviewed By: kimishpatel
Differential Revision: D34456794
Pulled By: digantdesai
fbshipit-source-id: 85dbfe0211de7846d8a84321b14fdb061cd6c037
(cherry picked from commit 6cf48e7b64d6979962d701b5d493998262cc8bfa)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72306
When enable, it will generate `torch_cuda_linalg` library, which would depend on cusolve and magma and registers dynamic bindings to it from LinearAlgebraStubs
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33992795
Pulled By: malfet
fbshipit-source-id: d1fa351a320659b29754997c20d754e69bfe36c0
(cherry picked from commit d5d6c69a988b9454538ecd28674206da2541de17)
Summary:
We noticed that on M1 Macs Tranformer network profiles are dominated by scalar `exp` and `erff` functions (for softmax and GELU).
The NEON `Vectorized<float>` implementation does not use SLEEF functions in order to compile on mobile platforms. However, SLEEF is already compiled on macOS ARM64 and is safe to use there. This change adds another implementation of `Vectorized<float>` that uses SLEEF functions. This implementation is only used on macOS ARM64.
This change speeds up e.g. prediction of spaCy transformer models by 20% on M1 Macs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70354
Reviewed By: albanD
Differential Revision: D33659540
Pulled By: kimishpatel
fbshipit-source-id: b8f02a61321873fc60778190a005c466c7d0cc0c
(cherry picked from commit 71286a207c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68247
This splits `Functions.h`, `Operators.h`, `NativeFunctions.h` and
`NativeMetaFunctions.h` into seperate headers per operator base name.
With `at::sum` as an example, we can include:
```cpp
<ATen/core/sum.h> // Like Functions.h
<ATen/core/sum_ops.h> // Like Operators.h
<ATen/core/sum_native.h> // Like NativeFunctions.h
<ATen/core/sum_meta.h> // Like NativeMetaFunctions.h
```
The umbrella headers are still being generated, but all they do is
include from the `ATen/ops' folder.
Further, `TensorBody.h` now only includes the operators that have
method variants. Which means files that only include `Tensor.h` don't
need to be rebuilt when you modify function-only operators. Currently
there are about 680 operators that don't have method variants, so this
is potentially a significant win for incremental builds.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D32596272
Pulled By: albanD
fbshipit-source-id: 447671b2b6adc1364f66ed9717c896dae25fa272
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69710
Namely no range-loop-analysis (that detect when loop variable can not be const reference
Test Plan: Imported from OSS
Reviewed By: r-barnes
Differential Revision: D32997003
Pulled By: malfet
fbshipit-source-id: dba0e7875e5b667e2cc394c70dd75e2403265918
Summary:
This fixes the `USE_PRECOMPILED_HEADERS` cmake version check which was accidentally inverted, so it was always disabled.
I've also made the precompiled header so it only includes headers used in 95% or more of code, weighted by compile time. This limits it to the standard library, `c10` and a limited subset of `ATen/core`. Crucially, the new pch doesn't depend on `native_functions.yaml` so won't cause as much unnecessary rebuilding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67851
Reviewed By: zou3519
Differential Revision: D32290902
Pulled By: dagitses
fbshipit-source-id: dfc33330028c99b02ff40963926c1f1260d00d00
Summary:
There is a bug in CMake's Ninja generator where files considered inputs to the cmake command couldn't be generated by another build step. The fix was included in CMake 3.13, but 3.10.3 is still sufficient for other cmake generators e.g. makefiles.
For reference, the bug is here https://gitlab.kitware.com/cmake/cmake/-/issues/18584
This is necessary for https://github.com/pytorch/pytorch/issues/68246 but I'm isolating the change here to make testing easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68731
Reviewed By: jbschlosser
Differential Revision: D32604545
Pulled By: malfet
fbshipit-source-id: 9bc0bd8641ba415dd63ce21a05c177e2f1dd9866
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68180
Since we've open sourced the tracing-based selective build, we can deprecate the
op-dependency-graph-based selective build and the static analyzer tool that
produces the dependency graph.
ghstack-source-id: 143108377
Test Plan: CIs
Reviewed By: seemethere
Differential Revision: D32358467
fbshipit-source-id: c61523706b85a49361416da2230ec1b035b8b99c
Summary:
CAFFE2 has been deprecated for a while, but still included in every PyTorch build.
We should stop building it by default, although CI should still validate that caffe2 code is buildable.
Build even fewer dependencies when compiling mobile builds without Caffe2
Introduce `TEST_CAFFE2` in torch.common.utils
Skip `TestQuantizedEmbeddingOps` and `TestJit.test_old_models_bc` is code is compiled without Caffe2
Should be landed after https://github.com/pytorch/builder/pull/864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66658
Reviewed By: driazati, seemethere, janeyx99
Differential Revision: D31669156
Pulled By: malfet
fbshipit-source-id: 1cc45e2d402daf913a4685eb9f841cc3863e458d
Summary:
As title, introduce the file `TracerRunner` shared by internal/external tracer and the main function is
```
TracerResult trace_run(const std::string& input_module_path);
```
which basically takes the path to model file and generate the trace result. The main difference between external tracer and internal tracer is
1. the dependency on `<yaml-cpp/yaml.h>`.
2. the output yaml file from internal tracer includes `model_version` and `model_asset`. These are only needed for internal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64152
ghstack-source-id: 140692467
Test Plan:
```
./build/bin/model_tracer --model_input_path "/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_with_bundled_input.ptl" --build_yaml_path "/Users/chenlai/Documents/pytorch/tracing/tmp.yaml"
```
```
./fbcode/caffe2/fb/model_tracer/run_model_with_bundled_inputs.sh ~/local/notebooks/prod_models/deeplabv3_scripted_with_bundled_input.ptl
```
have the same operator output
selected_operators.yaml (P460296279)
selected_mobile_ops.h (P460296258)
Reviewed By: dhruvbird
Differential Revision: D30632224
fbshipit-source-id: eb0321dbc0f1fcf6d2e05384695eebb59ac04f8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65401
Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols. Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: gdankel
Differential Revision: D31082208
Pulled By: ezyang
fbshipit-source-id: 14f66af920847e158436b5801c43f3124b109b34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445
PyTorch currently uses the old style of compiling CUDA in CMake which is just a
bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as
a language just like C++ or C.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31503350
fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55
Summary:
Previously https://github.com/pytorch/pytorch/pull/64087 broke the test `binary_macos_wheel_3_7_cpu_build`, because wheel build is not happy with `model_tracer`. Considering it's prototype and there is no need to ship model_tracer via wheel at the moment, using the option `TRACING_BASED` for building tracer. When tracing-based is mature enough, we can ship the tracer binary via wheel eventually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66267
Original commit changeset: 8ac3d75a52d0
ghstack-source-id: 140122106
Test Plan:
binary_macos_wheel_3_7_cpu_build passes
{F668643831}
Reviewed By: dhruvbird
Differential Revision: D31478593
fbshipit-source-id: 726cab1b31c4596f6268b7824eecb20e2e59d161
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Do not delete `caffe2::OperatorBase::Output` calls as they have side effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041
Reviewed By: ngimel
Differential Revision: D31360142
Pulled By: malfet
fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954
Reviewed By: ngimel
Differential Revision: D31326599
Pulled By: malfet
fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65732
For certain on-device uses, runtime memory comes at a premium. On-device deployments won't use all the available dispatch keys, so it makes sense to keep only the on-device specific ones around for such uses to reduce runtime heap memory allocated.
This change keeps just 10 dispatch keys (the ones that used on-device), guarded under the `C10_MOBILE_TRIM_DISPATCH_KEYS` macro. it tries to keep the other code-paths unaffected and uses `constexpr` for use in the `array` declaration, and simple inline functions to ensure that the compiler is able to optimize these for server builds.
Test Plan:
Build and check mobile models end to end.
```
buck build -c "pt.enable_milan_dispatch_keys_trimming"=1 //xplat/caffe2/fb/lite_predictor:lite_predictor
```
Reviewed By: ezyang
Differential Revision: D31185407
fbshipit-source-id: e954765606373dea6ee9466a851dca7684167b0b
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:
- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).
To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.
internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`
updates affecting integration:
1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745
Reviewed By: saketh-are
Differential Revision: D30752939
Pulled By: malfet
fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
Summary:
The library will no longer link properly on VS 2019 (14.29.30133). To
ensure that engineers building on Windows can use and debug with this
build type, incremental linking needs to be turned off for this build
flag.
Verified that this build type successfully builds, links, and provides
debuggable Python modules on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64892
Reviewed By: jbschlosser
Differential Revision: D30902565
Pulled By: malfet
fbshipit-source-id: e5286a4c6f45c7cbe4cdc1b98560129bd386970b
Summary:
We currently build breakpad from [this fork](https://github.com/driazati/breakpad) to include extra logic to restore signal handlers that were previously present. With some [new additions](https://github.com/google/breakpad/compare/main...driazati:main) this fork now includes a CMake based build, so we can add breakpad as a proper dependency rather than rely on including it in Docker images as a system library which is error prone (we have a bunch of images) and hard to extend to MacOS / Windows. This also includes some changes to the crash handling code to support MacOS / Windows in a similar way to Linux.
```python
import torch
# On Windows this writes crashes to C:\Users\<user>\AppData\pytorch_crashes
# On MacOS/Linux this writes crashes to /tmp/pytorch_crashes
torch.utils._crash_handler.enable_minidumps()
# Easy way to cause a segfault and trigger the handler
torch.bincount(input=torch.tensor([9223372036854775807]))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63186
Reviewed By: malfet, seemethere
Differential Revision: D30318404
Pulled By: driazati
fbshipit-source-id: 0d7daf3701cfaba5451cc529a0730272ab1eb1dc
Summary:
This option was added in https://github.com/pytorch/pytorch/issues/61940 and fits with this section's theme of improving build times.
I've also changed it to a `cmake_dependent_option` instead of `FATAL_ERROR`ing for older CMake versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62827
Reviewed By: astaff
Differential Revision: D30342102
Pulled By: malfet
fbshipit-source-id: 3095b44b7085aee8a884ec95cba9f8998d4442e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62419
This diff adds support for cpu only kineto profiler on mobile. Thus
enabling chrome trace generation on mobile. This bring cpp API for
mobile profiling on part with Torchscript.
This is done via:
1. Utilizating debug handle annotations in KinetoEvent.
2. Adding post processing capability, via callbacks, to
KinetoThreadLocalState
3. Creating new RAII stype profiler, KinetoEdgeCPUProfiler, which can be
used in surrounding scope of model execution. This will write chrome
trace to the location specified in profiler constructor.
Test Plan:
MobileProfiler.ModuleHierarchy
Imported from OSS
Reviewed By: raziel
Differential Revision: D29993660
fbshipit-source-id: 0b44f52f9e9c5f5aff81ebbd9273c254c3c03299
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61940
This adds a `USE_PRECOMPILED_HEADERS` option to the CMake build which
precompiles `ATen.h` and also `CUDAContext.h` for the cuda library.
After making a change in `native_functions.yaml`, this speeds up compilation
time by around 15% on my machine.
Test Plan: Imported from OSS
Reviewed By: heitorschueroff
Differential Revision: D29988775
Pulled By: malfet
fbshipit-source-id: a23c468c958a8b74ebaef052a5b2e5fa3836c64b
Summary:
This PR: (1) enables the use of a system-provided Intel TBB for building PyTorch, (2) removes `tbb:task_scheduler_init` references since it has been removed from TBB a while ago (3) marks the implementation of `_internal_set_num_threads` with a TODO as it requires a revision that fixes its thread allocation logic.
Tested with `test/run_test`; no new tests are introduced since there are no behavioral changes (removal of `tbb::task_scheduler_init` has no impact on the runtime behavior).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61934
Reviewed By: malfet
Differential Revision: D29805416
Pulled By: cbalioglu
fbshipit-source-id: 22042b428b57b8fede9dfcc83878d679a19561dd
Summary:
To make build behaviour aligned with other third_party/ libraries,
introduce `USE_SYSTEM_PYBIND11 (d55b25a633)` build option, which set to OFF by
default, which means PyTorch will be build with bundled pybind11 even if
other version is already installed locally.
Fixes https://github.com/pytorch/pytorch/issues/58750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58951
Reviewed By: driazati
Differential Revision: D28690411
Pulled By: malfet
fbshipit-source-id: e56b5a8f2a23ee1834b2a6d3807f287149decf8c
Summary:
Since v1.7, oneDNN (MKL-DNN) has supported the use of Compute Library
for the Arm architeture to provide optimised convolution primitives
on AArch64.
This change enables the use of Compute Library in the PyTorch build.
Following the approach used to enable the use of CBLAS in MKLDNN,
It is enabled by setting the env vars USE_MKLDNN and USE_MKLDNN_ACL.
The location of the Compute Library build must be set useing `ACL_ROOT_DIR`.
This is an extension of the work in https://github.com/pytorch/pytorch/pull/50400
which added support for the oneDNN/MKL-DNN backend on AArch64.
_Note: this assumes that Compute Library has been built and installed at
ACL_ROOT_DIR. Compute library can be downloaded here:
`https://github.com/ARM-software/ComputeLibrary`_
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55913
Reviewed By: ailzhang
Differential Revision: D28559516
Pulled By: malfet
fbshipit-source-id: 29d24996097d0a54efc9ab754fb3f0bded290005
Summary:
This PR is step 0 of adding PyTorch convolution bindings using the cuDNN frontend. The cuDNN frontend is the recommended way of using cuDNN v8 API. It is supposed to have faster release cycles, so that, for example, if people find a specific kernel has a bug, they can report it, and that kernel will be blocked in the cuDNN frontend and frameworks could just update that submodule without the need for waiting for a whole cuDNN release.
The work is not complete, and this PR is only step 0.
**What this PR does:**
- Add cudnn-frontend as a submodule.
- Modify cmake to build that submodule.
- Add bindings for convolution forward in `Conv_v8.cpp`, which is disabled by a macro by default.
- Tested manually by enabling the macro and run `test_nn.py`. All tests pass except those mentioned below.
**What this PR doesn't:**
- Only convolution forward, no backward. The backward will use v7 API.
- No 64bit-indexing support for some configuration. This is a known issue of cuDNN, and will be fixed in a later cuDNN version. PyTorch will not implement any workaround for issue, but instead, v8 API should be disabled on problematic cuDNN versions.
- No test beyond PyTorch's unit tests.
- Not tested for correctness on real models.
- Not benchmarked for performance.
- Benchmark cache is not thread-safe. (This is marked as `FIXME` in the code, and will be fixed in a follow-up PR)
- cuDNN benchmark is not supported.
- There are failing tests, which will be resolved later:
```
FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (in...
FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (...
FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_large_cuda - RuntimeError: CUDNN_BACKEND_OPERATION: cudnnFinalize Failed cudnn_status: 9
FAILED test/test_nn.py::TestNN::test_Conv2d_depthwise_naive_groups_cuda - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=1e-05, found 64 element(s) (out of 64) whose difference(s) exceeded the margin of error (including 0 an...
FAILED test/test_nn.py::TestNN::test_Conv2d_deterministic_cudnn - RuntimeError: not supported yet
FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_fp32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_tf32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
```
Although this is not a complete implementation of cuDNN v8 API binding, I still want to merge this first. This would allow me to do small and incremental work, for the ease of development and review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51390
Reviewed By: malfet
Differential Revision: D28513167
Pulled By: ngimel
fbshipit-source-id: 9cc20c9dec5bbbcb1f94ac9e0f59b10c34f62740
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54284
In order to bring mobile deployment, via lite interpreter, on feature
parity with JIT, with respect model level debug information we must make
model level debug information available to mobile runtime.
At the moment, model level debug information is stored in SourceRange
which associates node's of graph to where the come from in original
python source code.
This information is serialized as part of debug_pkl and deserialized
when JIT loads the model and reads the model code.
On lite interpreter, we do not have access to all the functionality of
JIT and hence we cannot load model in the same way as JIT, by reading
code, constructing module hierarchy and graph corresponding module
methods etc. Instead in, lite interpreter, only bytecode corresonding to
the compiled graph, Code, is saved.
Thus in order to annotate OPs in the bytecode with equivalent
SourceRange information we do the following:
1. During model serialization, we create a unique tag for each source
range of the model.
2. Create a map of <SourceRange, tag>
3. During debug_pkl serialization we save tag along with SourceRange, on
top of byte offset.
4. During bytecode generation, the methods of the top module are
lowered. During this process methods are inlined. In the inlined graph,
when the node of a graph is lowered to bytecode, we query node's source
range and look it up against the map.
5. Resulting source range tag is serialized in module_debug_info.
6. During model deserialization, we read all the debug_pkl records in
the archieve and create a map of <tag, SourceRange>
7. This map can be used to find source code information.
During mobile runtime:
1. We read all the debug_pkl records and create <tag=debug_handle,
SourceRange> map.
1.1 This map, MobileDebugInfo, is a member of mobile Module.
2. Interpreter catches appropriate exceptions and sets the thread local
debug handle and rethrows the exception.
3. In Function's run method we catch exception and query current debug
handle where the exception happened.
4. Query MobileDebugInfo with debug handle to retrieve source range and
augment error with source range info.
This information is still incomplete as it does not contain entire
callstack.
In the following diffs we will serialize InlinedCallStack directly.
Note that compilation is gated by SYMBOLICATE_MOBILE_DEBUG_HANDLE macro,
so that mobile builds can avoid building MobileDebugInfo, source range
and source range pickler/unpickler. Later we will add path where, if
building without debug support stack trace will contain only debug
handles. They can be symbolicated later.
Test Plan:
Ported bunch of source range tests from test_jit.py. Added on more test
in test_lite_interpreter.py
Imported from OSS
Reviewed By: raziel
Differential Revision: D27174722
fbshipit-source-id: a7b7c6088ce16dec37e823c7fefa4f0b61047e12
Summary:
This adds some more compiler warnings ignores for everything that happens on a standard CPU build (CUDA builds still have a bunch of warnings so we can't turn on `-Werror` everywhere yet).
](https://our.intern.facebook.com/intern/diff/28005063/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56630
Pulled By: driazati
Reviewed By: malfet
Differential Revision: D28005063
fbshipit-source-id: 541ed415eb0470ddf7e08c22c5eb6da9db26e9a0
Summary:
Many model pipelines/workflows don't use MAGMA even though it is included in the build by default. Leaving MAGMA kernels out of the build can save 60+MB of GPU memory when loading `libtorch_cuda.so` (tested on V100, current upstream master).
A current sharp corner of this flag is that toggling it when rebuilding requires `torch/include/THC/THCGeneral.h` to be *manually* deleted by the user, as even running `make clean` or `setup.py` with `--cmake` does not properly regenerate it with the appropriate substitution for `#cmakedefine USE_MAGMA`. Is there a way to force the regeneration of the header during a rebuild?
CC malfet ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55994
Reviewed By: mruberry
Differential Revision: D27766287
Pulled By: malfet
fbshipit-source-id: 93deca57befa0febb9c5b7875ecf0015c547d421
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55835
Now that https://github.com/pytorch/pytorch/pull/55238 is landed for a
week and no complains. It seems safe to say FEATURE_TORCH_MOBILE is
always true and we can do some cleanup.
Test Plan: Imported from OSS
Reviewed By: ezyang, walterddr
Differential Revision: D27721284
Pulled By: ailzhang
fbshipit-source-id: 4896bc5f736373d0922cfbe8eed0d16df62f0fa1
Summary:
`ONNX_NAMESPACE` is empty by default if `USE_SYSTEM_ONNX ON`, while it should be equal to `onnx`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54973
Reviewed By: glaringlee
Differential Revision: D27466020
Pulled By: walterddr
fbshipit-source-id: 47cde3604acbda3f45bec5893036b39fd1eb58c9
Summary:
CMAKE_SYSTEM_PROCESSOR set to x86_64(on Linux) or AMD64 (5ec224496b)(on Windows) indicates build is running on x86_64 architecture, while `CMAKE_SYSTEM_PROCESSOR` set to aarch64 or arm64 means we running on ARMv8+ architecture.
Delete `i[3-6]86` pattern as 32-bit builds are no longer supported
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54637
Reviewed By: ezyang
Differential Revision: D27311897
Pulled By: malfet
fbshipit-source-id: 26989fc9b54a96d70c768ab03ca4528506ee7808
Summary:
https://ccache.dev/ is a compiler cache that speeds up subsequent builds. Auto-detecting ccache ensures that it is used on systems where it is available, greatly improving build times for developers. There is no risk in enabling ccache in practice. Please refer to https://ccache.dev/ for a short summary / motivation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49389
Reviewed By: ejguan
Differential Revision: D27169957
Pulled By: malfet
fbshipit-source-id: 673b60bbceb0d323901c8a992a75792c6da9b805
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51419
## Summary
1. Add an option `BUILD_LITE_INTERPRETER` in `caffe2/CMakeLists.txt` and set `OFF` as default.
2. Update 'build_android.sh' with an argument to swtich `BUILD_LITE_INTERPRETER`, 'OFF' as default.
3. Add a mini demo app `lite_interpreter_demo` linked with `libtorch` library, which can be used for quick test.
## Test Plan
Built lite interpreter version of libtorch and test with Image Segmentation demo app ([android version](https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation)/[ios version](https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation))
### Android
1. **Prepare model**: Prepare the lite interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
```
import torch
model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
model.eval()
scripted_module = torch.jit.script(model)
# Export full jit version model (not compatible lite interpreter), leave it here for comparison
scripted_module.save("deeplabv3_scripted.pt")
# Export lite interpreter version model (compatible with lite interpreter)
scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")
```
2. **Build libtorch lite for android**: Build libtorch for android for all 4 android abis (armeabi-v7a, arm64-v8a, x86, x86_64) `BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh`. This pr is tested on Pixel 4 emulator with x86, so use cmd `BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86` to specify abi to save built time. After the build finish, it will show the library path:
```
...
BUILD SUCCESSFUL in 55s
134 actionable tasks: 22 executed, 112 up-to-date
+ find /Users/chenlai/pytorch/android -type f -name '*aar'
+ xargs ls -lah
-rw-r--r-- 1 chenlai staff 13M Feb 11 11:48 /Users/chenlai/pytorch/android/pytorch_android/build/outputs/aar/pytorch_android-release.aar
-rw-r--r-- 1 chenlai staff 36K Feb 9 16:45 /Users/chenlai/pytorch/android/pytorch_android_torchvision/build/outputs/aar/pytorch_android_torchvision-release.aar
```
3. **Use the PyTorch Android libraries built from source in the ImageSegmentation app**: Create a folder 'libs' in the path, the path from repository root will be `ImageSegmentation/app/libs`. Copy `pytorch_android-release` to the path `ImageSegmentation/app/libs/pytorch_android-release.aar`. Copy 'pytorch_android_torchvision` (downloaded from [here](https://oss.sonatype.org/#nexus-search;quick~torchvision_android)) to the path `ImageSegmentation/app/libs/pytorch_android_torchvision.aar` Update the `dependencies` part of `ImageSegmentation/app/build.gradle` to
```
dependencies {
implementation 'androidx.appcompat:appcompat:1.2.0'
implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
testImplementation 'junit:junit:4.12'
androidTestImplementation 'androidx.test.ext:junit:1.1.2'
androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'
implementation(name:'pytorch_android-release', ext:'aar')
implementation(name:'pytorch_android_torchvision', ext:'aar')
implementation 'com.android.support:appcompat-v7:28.0.0'
implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
}
```
Update `allprojects` part in `ImageSegmentation/build.gradle` to
```
allprojects {
repositories {
google()
jcenter()
flatDir {
dirs 'libs'
}
}
}
```
4. **Update model loader api**: Update `ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java` by
4.1 Add new import: `import org.pytorch.LiteModuleLoader;`
4.2 Replace the way to load pytorch lite model
```
// mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
```
5. **Test app**: Build and run the ImageSegmentation app in Android Studio,

### iOS
1. **Prepare model**: Same as Android.
2. **Build libtorch lite for ios** `BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR BUILD_LITE_INTERPRETER=1 ./scripts/build_ios.sh`
3. **Remove Cocoapods from the project**: run `pod deintegrate`
4. **Link ImageSegmentation demo app with the custom built library**:
Open your project in XCode, go to your project Target’s **Build Phases - Link Binaries With Libraries**, click the **+** sign and add all the library files located in `build_ios/install/lib`. Navigate to the project **Build Settings**, set the value **Header Search Paths** to `build_ios/install/include` and **Library Search Paths** to `build_ios/install/lib`.
In the build settings, search for **other linker flags**. Add a custom linker flag below
```
-all_load
```
Finally, disable bitcode for your target by selecting the Build Settings, searching for Enable Bitcode, and set the value to No.
**
5. Update library and api**
5.1 Update `TorchModule.mm``
To use the custom built libraries the project, replace `#import <LibTorch/LibTorch.h>` (in `TorchModule.mm`) which is needed when using LibTorch via Cocoapods with the code below:
```
//#import <LibTorch/LibTorch.h>
#include "ATen/ATen.h"
#include "caffe2/core/timer.h"
#include "caffe2/utils/string_utils.h"
#include "torch/csrc/autograd/grad_mode.h"
#include "torch/script.h"
#include <torch/csrc/jit/mobile/function.h>
#include <torch/csrc/jit/mobile/import.h>
#include <torch/csrc/jit/mobile/interpreter.h>
#include <torch/csrc/jit/mobile/module.h>
#include <torch/csrc/jit/mobile/observer.h>
```
5.2 Update `ViewController.swift`
```
// if let filePath = Bundle.main.path(forResource:
// "deeplabv3_scripted", ofType: "pt"),
// let module = TorchModule(fileAtPath: filePath) {
// return module
// } else {
// fatalError("Can't find the model file!")
// }
if let filePath = Bundle.main.path(forResource:
"deeplabv3_scripted", ofType: "ptl"),
let module = TorchModule(fileAtPath: filePath) {
return module
} else {
fatalError("Can't find the model file!")
}
```
### Unit test
Add `test/cpp/lite_interpreter`, with one unit test `test_cores.cpp` and a light model `sequence.ptl` to test `_load_for_mobile()`, `bc.find_method()` and `bc.forward()` functions.
### Size:
**With the change:**
Android:
x86: `pytorch_android-release.aar` (**13.8 MB**)
IOS:
`pytorch/build_ios/install/lib` (lib: **66 MB**):
```
(base) chenlai@chenlai-mp lib % ls -lh
total 135016
-rw-r--r-- 1 chenlai staff 3.3M Feb 15 20:45 libXNNPACK.a
-rw-r--r-- 1 chenlai staff 965K Feb 15 20:45 libc10.a
-rw-r--r-- 1 chenlai staff 4.6K Feb 15 20:45 libclog.a
-rw-r--r-- 1 chenlai staff 42K Feb 15 20:45 libcpuinfo.a
-rw-r--r-- 1 chenlai staff 39K Feb 15 20:45 libcpuinfo_internals.a
-rw-r--r-- 1 chenlai staff 1.5M Feb 15 20:45 libeigen_blas.a
-rw-r--r-- 1 chenlai staff 148K Feb 15 20:45 libfmt.a
-rw-r--r-- 1 chenlai staff 44K Feb 15 20:45 libpthreadpool.a
-rw-r--r-- 1 chenlai staff 166K Feb 15 20:45 libpytorch_qnnpack.a
-rw-r--r-- 1 chenlai staff 384B Feb 15 21:19 libtorch.a
-rw-r--r-- 1 chenlai staff **60M** Feb 15 20:47 libtorch_cpu.a
```
`pytorch/build_ios/install`:
```
(base) chenlai@chenlai-mp install % du -sh *
14M include
66M lib
2.8M share
```
**Master (baseline):**
Android:
x86: `pytorch_android-release.aar` (**16.2 MB**)
IOS:
`pytorch/build_ios/install/lib` (lib: **84 MB**):
```
(base) chenlai@chenlai-mp lib % ls -lh
total 172032
-rw-r--r-- 1 chenlai staff 3.3M Feb 17 22:18 libXNNPACK.a
-rw-r--r-- 1 chenlai staff 969K Feb 17 22:18 libc10.a
-rw-r--r-- 1 chenlai staff 4.6K Feb 17 22:18 libclog.a
-rw-r--r-- 1 chenlai staff 42K Feb 17 22:18 libcpuinfo.a
-rw-r--r-- 1 chenlai staff 1.5M Feb 17 22:18 libeigen_blas.a
-rw-r--r-- 1 chenlai staff 44K Feb 17 22:18 libpthreadpool.a
-rw-r--r-- 1 chenlai staff 166K Feb 17 22:18 libpytorch_qnnpack.a
-rw-r--r-- 1 chenlai staff 384B Feb 17 22:19 libtorch.a
-rw-r--r-- 1 chenlai staff 78M Feb 17 22:19 libtorch_cpu.a
```
`pytorch/build_ios/install`:
```
(base) chenlai@chenlai-mp install % du -sh *
14M include
84M lib
2.8M share
```
Test Plan: Imported from OSS
Reviewed By: iseeyuan
Differential Revision: D26518778
Pulled By: cccclai
fbshipit-source-id: 4503ffa1f150ecc309ed39fb0549e8bd046a3f9c
Summary:
- Allows build process to build with MLC enabled if subrepo folder mlc is in path and we can link against ML Compute on macOS BigSur
- To build with MLC enabled you will need to clone the mlc repo inside the pytorch repository.
- We need both this change and https://github.com/pytorch/pytorch/pull/50634 on pytorch/pytorch to enable the `mlc` device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51326
Reviewed By: glaringlee
Differential Revision: D26533138
Pulled By: malfet
fbshipit-source-id: 0baa06b4eb2d62dbfc0f6fc922096cb0db1cc7d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51957
This is a simplified version of #51554.
Compared to #51554, this version only supports statically dispatching to
a specific backend. The benefit is that it skipped the dispatch key
computation logic thus has less framework overhead. The downside is that
if input tensors do not match the specified backend it will throw error
instead of falling back to regular dispatch.
Sample code:
```
Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) {
return at::cpu::empty(size, options, memory_format);
}
// aten::conj(Tensor(a) self) -> Tensor(a)
Tensor conj(const Tensor & self) {
return at::math::conj(self);
}
// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_out(Tensor & out, const Tensor & self) {
return at::cpu::conj_out(out, self);
}
// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_outf(const Tensor & self, Tensor & out) {
return at::cpu::conj_out(out, self);
}
// aten::_conj(Tensor self) -> Tensor
Tensor _conj(const Tensor & self) {
return at::defaultbackend::_conj(self);
}
```
For ops without the specific backend dispatch, it will throw error:
```
// aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool
bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) {
TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU.");
}
```
Differential Revision: D26337857
Test Plan: Imported from OSS
Reviewed By: bhosmer
Pulled By: ljk53
fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
Summary:
Currently, adding a cross compile build is failing on CI due to a cmake builtin compiler check that does not pass due to cross compiling the host protoc library.
Setting the CMAKE_TRY_COMPILE_TARGET_TYPE flag should fix it. (Based on this [SOF answer](https://stackoverflow.com/questions/53633705/cmake-the-c-compiler-is-not-able-to-compile-a-simple-test-program).)
To test that this works, please run: `CMAKE_OSX_ARCHITECTURES=arm64 USE_MKLDNN=OFF USE_NNPACK=OFF USE_QNNPACK=OFF USE_PYTORCH_QNNPACK=OFF BUILD_TEST=OFF python setup.py install` from a Mac x86_64 machine with Xcode12.3 (anything with MacOS 11 SDK).
Then, you can check that things were compiled for arm by running `lipo -info <file>` for any file in the `build/lib` directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50922
Reviewed By: malfet
Differential Revision: D26355054
Pulled By: janeyx99
fbshipit-source-id: 919f3f9bd95d7c7bba6ab3a95428d3ca309f8ead
Summary:
Because of the size of our `libtorch_cuda.so`, linking with other hefty binaries presents a problem where 32bit relocation markers are too small and end up overflowing. This PR attempts to break up `torch_cuda` into `torch_cuda_cu` and `torch_cuda_cpp`.
`torch_cuda_cu`: all the files previously in `Caffe2_GPU_SRCS` that are
* pure `.cu` files in `aten`match
* all the BLAS files
* all the THC files, except for THCAllocator.cpp, THCCachingHostAllocator.cpp and THCGeneral.cpp
* all files in`detail`
* LegacyDefinitions.cpp and LegacyTHFunctionsCUDA.cpp
* Register*CUDA.cpp
* CUDAHooks.cpp
* CUDASolver.cpp
* TensorShapeCUDA.cpp
`torch_cuda_cpp`: all other files in `Caffe2_GPU_SRCS`
Accordingly, TORCH_CUDA_API and TORCH_CUDA_BUILD_MAIN_LIB usages are getting split as well to TORCH_CUDA_CU_API and TORCH_CUDA_CPP_API.
To test this locally, you can run `export BUILD_SPLIT_CUDA=ON && python setup.py develop`. In your `build/lib` folder, you should find binaries for both `torch_cuda_cpp` and `torch_cuda_cu`. To see that the SPLIT_CUDA option was toggled, you can grep the Summary of running cmake and make sure `Split CUDA` is ON.
This build option is tested on CI for CUDA 11.1 builds (linux for now, but windows soon).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49050
Reviewed By: walterddr
Differential Revision: D26114310
Pulled By: janeyx99
fbshipit-source-id: 0180f2519abb5a9cdde16a6fb7dd3171cff687a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51124
Original commit changeset: 1c7133627da2
Test Plan: Test locally with interpreter_test and on CI
Reviewed By: suo
Differential Revision: D26077905
fbshipit-source-id: fae83bf9822d79e9a9b5641bc5191a7f3fdea78d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50458
libinterpreter.so contains a frozen python distribution including
torch-python bindings.
Freezing refers to serializing bytecode of python standard library modules as
well as the torch python library and embedding them in the library code. This
library can then be dlopened multiple times in one process context, each
interpreter having its own python state and GIL. In addition, each python
environment is sealed off from the filesystem and can only import the frozen
modules included in the distribution.
This change relies on newly added frozenpython, a cpython 3.8.6 fork built for this purpose. Frozenpython provides libpython3.8-frozen.a which
contains frozen bytecode and object code for the python standard library.
Building on top of frozen python, the frozen torch-python bindings are added in
this diff, providing each embedded interpreter with a copy of the torch
bindings. Each interpreter is intended to share one instance of libtorch and
the underlying tensor libraries.
Known issues
- Autograd is not expected to work with the embedded interpreter currently, as it manages
its own python interactions and needs to coordinate with the duplicated python
states in each of the interpreters.
- Distributed and cuda stuff is disabled in libinterpreter.so build, needs to be revisited
- __file__ is not supported in the context of embedded python since there are no
files for the underlying library modules.
using __file__
- __version__ is not properly supported in the embedded torch-python, just a
workaround for now
Test Plan: tested locally and on CI with cmake and buck builds running torch::deploy interpreter_test
Reviewed By: ailzhang
Differential Revision: D25850783
fbshipit-source-id: a4656377caff25b73913daae7ae2f88bcab8fd88
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50288
torch::deploy will bundle the objects contained in libtorch-python together with frozenpython into a shared library. Therefore, the libtorch-python objs can't bring with them a dependency on system python.
Buck TARGETS are added throughout the caffe2 tree to make available objects or headers that will be needed by torch::deploy but would have brought unsuitable dependencies if accessed using existing targets.
CMakeLists are modified to separate a torch-python-objs object library which lets torch::deploy compile these objs with the same compile flags as libttorch_python used, but without some of the link-time dependencies such as python.
CudaIPCTypes is moved from libtorch_python to libtorch_cuda because it is really not a python binding, and it statically registers a cuda_ipc_callback which would be duplicated if included in each copy of torch::deploy.
Test Plan: no new functionality, just ensure existing tests continue to pass
Reviewed By: malfet
Differential Revision: D25850785
fbshipit-source-id: b0b81c050cbee04e9de96888f8a09d29238a9db8
Summary:
draft enable fast_nvcc.
* cleaned up some non-standard usages
* added fall-back to wrap_nvcc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49773
Test Plan:
Configuration to enable fast nvcc:
- install and enable `ccache` but delete `.ccache/` folder before each build.
- `TORCH_CUDA_ARCH_LIST=6.0;6.1;6.2;7.0;7.5`
- Toggling `USE_FAST_NVCC=ON/OFF` cmake config and run `cmake --build` to verify the build time.
Initial statistic for a full compilation:
* `cmake --build . -- -j $(nproc)`:
- fast NVCC
```
real 48m55.706s
user 1559m14.218s
sys 318m41.138s
```
- normal NVCC:
```
real 43m38.723s
user 1470m28.131s
sys 90m46.879s
```
* `cmake --build . -- -j $(nproc/4)`:
- fast NVCC:
```
real 53m44.173s
user 1130m18.323s
sys 71m32.385s
```
- normal NVCC:
```
real 81m53.768s
user 858m45.402s
sys 61m15.539s
```
* Conclusion: fast NVCC doesn't provide too much gain when compiler is set to use full CPU utilization, in fact it is **even worse** because of the thread switcing.
initial statistic for partial recompile (edit .cu files)
* `cmake --build . -- -j $(nproc)`
- fast NVCC:
```
[2021-01-13 18:10:24] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 18:11:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
- normal NVCC:
```
[2021-01-13 17:35:40] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 17:38:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
* Conclusion: Effective compilation time for single CU file modification reduced from from 2min30sec to only 40sec when compiling multiple architecture. This shows **4X** gain in speed up using fast NVCC -- reaching the theoretical limit of 5X when compiling 5 gencode architecture at the same time.
Follow up PRs:
- should have better fallback mechanism to detect whether a build is supported by fast_nvcc or not instead of dryruning then fail with fallback.
- performance measurement instrumentation to measure what's the total compile time vs the parallel tasks critical path time.
- figure out why `-j $(nproc)` gives significant sys overhead (`sys 318m41.138s` vs `sys 90m46.879s`) over normal nvcc, guess this is context switching, but not exactly sure
Reviewed By: malfet
Differential Revision: D25692758
Pulled By: walterddr
fbshipit-source-id: c244d07b9b71f146e972b6b3682ca792b38c4457
Summary:
Since version 1.6, oneDNN has provided limited support for AArch64 builds.
This minor change is to detect an AArch64 CPU and permit the use of
`USE_MKLDNN` in that case.
Build flags for oneDNN are also modified accordingly.
Note: oneDNN on AArch64, by default, will use oneDNN's reference C++ kernels.
These are not optimised for AArch64, but oneDNN v1.7 onwards provides support
for a limited set of primitives based Arm Compute Library.
See: https://github.com/oneapi-src/oneDNN/pull/795
and: https://github.com/oneapi-src/oneDNN/pull/820
for more details. Support for ACL-based oneDNN primitives in PyTorch
will require some further modification,
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50400
Reviewed By: izdeby
Differential Revision: D25886589
Pulled By: malfet
fbshipit-source-id: 2c81277a28ad4528c2d2211381e7c6692d952bc1
Summary:
This PR is a step towards enabling cross compilation from x86_64 to arm64.
The following has been added:
1. When cross compilation is detected, compile a local universal fatfile to use as protoc.
2. For the simple compile check in MiscCheck.cmake, make sure to compile the small snippet as a universal binary in order to run the check.
**Test plan:**
Kick off a minimal build on a mac intel machine with the macOS 11 SDK with this command:
```
CMAKE_OSX_ARCHITECTURES=arm64 USE_MKLDNN=OFF USE_QNNPACK=OFF USE_PYTORCH_QNNPACK=OFF BUILD_TEST=OFF USE_NNPACK=OFF python setup.py install
```
(If you run the above command before this change, or without macOS 11 SDK set up, it will fail.)
Then check the platform of the built binaries using this command:
```
lipo -info build/lib/libfmt.a
```
Output:
- Before this PR, running a regular build via `python setup.py install` (instead of using the flags listed above):
```
Non-fat file: build/lib/libfmt.a is architecture: x86_64
```
- Using this PR:
```
Non-fat file: build/lib/libfmt.a is architecture: arm64
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50243
Reviewed By: malfet
Differential Revision: D25849955
Pulled By: janeyx99
fbshipit-source-id: e9853709a7279916f66aa4c4e054dfecced3adb1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49201
This unblocks kineto profiler for 1.8 release.
This PR supercedes https://github.com/pytorch/pytorch/pull/48391
Note: this will somewhat increase the size of linux server binaries, bc
we add libkineto.a and libcupti_static.a:
-rw-r--r-- 1 jenkins jenkins 1107502 Dec 10 21:16 build/lib/libkineto.a
-rw-r--r-- 1 root root 13699658 Nov 13 2019 /usr/local/cuda/lib64/libcupti_static.a
Test Plan:
CI
https://github.com/pytorch/pytorch/pull/48391
Imported from OSS
Reviewed By: ngimel
Differential Revision: D25480770
fbshipit-source-id: 037cd774f5547d9918d6055ef5cc952a54e48e4c
Summary:
Use `UL` suffix supported by all C99 compatible compilers instead of `__AARCH64_UINT64_C`, which is a gcc specific extension
Before the change this check would have failed as follows with a bug-free clang compiler with the following errors:
```
$ clang has_vst1.c
has_vst1.c:5:41: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
v.val[0] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
^
has_vst1.c:5:79: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
v.val[0] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
^
has_vst1.c:6:41: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
v.val[1] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
^
has_vst1.c:6:79: warning: implicit declaration of function '__AARCH64_UINT64_C' is invalid in C99 [-Wimplicit-function-declaration]
v.val[1] = vcombine_f32 (vcreate_f32 (__AARCH64_UINT64_C (0)), vcreate_f32 (__AARCH64_UINT64_C (0)));
^
4 warnings generated.
/tmp/has_vst1-b1e162.o: In function `main':
has_vst1.c:(.text+0x30): undefined reference to `__AARCH64_UINT64_C'
```
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49182
Reviewed By: walterddr
Differential Revision: D25471994
Pulled By: malfet
fbshipit-source-id: 0129a6f7aabc46aa117ef719d3a211449cb410f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46356
Adding the flag `-Werror=cast-function-type` to ensure we don't allow
any invalid casts (ex: PyCFunction casts).
For more details see: https://github.com/pytorch/pytorch/issues/45419
ghstack-source-id: 114632980
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D24319759
fbshipit-source-id: 26ce4650c220e8e9dd3550245f214c7e6c21a5dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46383
The old `USE_METAL` is actually being used by Caffe2. Here we introduce a new macro to enable metal in pytorch.
ghstack-source-id: 114499392
Test Plan:
- Circle CI
- The Person Segmentation model works
Reviewed By: linbinyu
Differential Revision: D24322018
fbshipit-source-id: 4e5548afba426b49f314366d89b18ba0c7e745ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46116
Ideally I would just use one of the existing preprocessor flags such as `FBCODE_CAFFE2`, but this implies a whole bunch of other things elsewhere, so it is not really a solution for ovrsource.
Test Plan: CI green, we are able to disable it internally with `-DNVALGRIND`
Reviewed By: malfet
Differential Revision: D24227360
fbshipit-source-id: 24a3b393cf46d6a16acca0a9ec52610d4bb8704f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46112
### Summary
This PR adds the support of running torchscript models on iOS GPU via Metal (Inference only). The feature is currently in prototype state, API changes are expected. The tutorial and the documents will be added once it goes to beta.
allow-large-files
- Users API
```
auto module = torch::jit::load(model);
module.eval();
at::Tensor input = at::ones({1,3,224,224}, at::ScalarType::Float).metal();
auto output = module.forward({input}).toTensor().cpu();
```
- Supported Models
- Person Segmentation v106 (FB Internal)
- Mobilenetv2
- Supported Operators
- aten::conv2d
- aten::addmm
- aten::add.Tensor
- aten::sub.Tensor
- aten::mul.Tensor
- aten::relu
- aten::hardtanh
- aten::hardtanh_
- aten::sigmoid
- aten::max_pool2d
- aten::adaptive_avg_pool2d
- aten::reshape
- aten::t
- aten::view
- aten::log_softmax.int
- aten::upsample_nearest2d.vec
- Supported Devices
- Apple A9 and above
- iOS 10.2 and above
- CMake scripts
- `IOS_ARCH=arm64 ./scripts/build_ios.sh -DUSE_METAL=ON`
### Test Plan
- Circle CI
ghstack-source-id: 114155638
Test Plan:
1. Sandcastle CI
2. Circle CI
Reviewed By: dreiss
Differential Revision: D23236555
fbshipit-source-id: 98ffc48b837e308bc678c37a9a5fd8ae72d11625
Summary:
CentOS 8 on AArch64 has vld1_* intrinsics but lacks vst1q_f32_x2 one.
This patch checks for it and handle it separately to vld1_* ones.
Fixes https://github.com/pytorch/pytorch/issues/44198
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44199
Reviewed By: seemethere
Differential Revision: D23641273
Pulled By: malfet
fbshipit-source-id: c2053c8e0427705eaeeeb82ec030925bff22623a
Summary:
According to [documentation](https://github.com/pytorch/pytorch/blob/master/tools/setup_helpers/cmake.py#L265), only options starts with `BUILD_` / `USE_` / `CMAKE_` in `CMakeLists.txt` can be imported by environment variables.
---
This diff is originally intended to enable `c++` source coverage with `CircleCI` and `codecov.io`, but we will finish it in the future. You can find the related information in the diff history. Following is the originally procedur:
Based on [this pull request](1bda5e480c), life becomes much easier for this time.
1.in `build.sh`
- Enable coverage builld option for c++
- `apt-get install lcov`
2.in `test.sh`
- run `lcov`
3.in `pytorch-job-specs.yml`
- copy coverage.info to `test/` folder and upload it to codecov.io
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43999
Test Plan: Test on github
Reviewed By: malfet
Differential Revision: D23464656
Pulled By: scintiller
fbshipit-source-id: b2365691f04681d25ba5c00293fbcafe8e8e0745
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564
Static dispatch was originally introduced for mobile selective build.
Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D23324452
Pulled By: ljk53
fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43154
Adds the build flag `BUILD_MOBILE_AUTOGRAD` which toggles whether autograd files should be included for a PyTorch mobile build (default off).
ghstack-source-id: 110369406
Test Plan: CI
Reviewed By: ljk53
Differential Revision: D23061913
fbshipit-source-id: bc3d6683ab17f158990d83e4fae0a011d5adeca1
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39968
tested with `TORCH_CUDA_ARCH_LIST='3.5 5.2 6.0 6.1 7.0 7.5 8.0+PTX'`, before this PR, it was failing, and with this PR, the build succeed.
With `TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0+PTX'`, `libtorch_cuda.so` with symbols changes from 2.9GB -> 2.2GB
cc: ptrblck mcarilli jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43074
Reviewed By: mrshenli
Differential Revision: D23176095
Pulled By: malfet
fbshipit-source-id: 7b3e6d049fc080e519f21e80df05ef68e7bea57e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42837
Originally we use
```
list(APPEND CMAKE_C_FLAGS -fprofile-instr-generate -fcoverage-mapping)
list(APPEND CMAKE_CXX_FLAGS -fprofile-instr-generate -fcoverage-mapping)
```
But when compile project on mac with Coverage On, it has the error:
`clang: error: no input files
/bin/sh: -fprofile-instr-generate: command not found
/bin/sh: -fcoverage-mapping: command not found`
The reason behind it, is `list(APPEND CMAKE_CXX_FLAGS` will add an additional `;` to the variable. This means, if we do `list(APPEND foo a)` and then `list(APPEND foo b)`, then `foo` will be `a;b` -- with the additional `;`. Since we have `CMAKE_CXX_FLAGS` defined before in the `CMakeList.txt`, we can only use `set(...)` here
After changing it to
```
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fprofile-instr-generate -fcoverage-mapping")
```
Test successufully in local mac machine.
Test Plan: Test locally on mac machine
Reviewed By: malfet
Differential Revision: D23043057
fbshipit-source-id: ff6f4891b35b7f005861ee2f8e4c550c997fe961
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40179
- Pass no-psabi to shut up GCC about # Suppress "The ABI for passing
parameters with 64-byte alignment has changed in GCC 4.6"
- Fix use of deprecated data() accessor (and minor optimization: hoist
accessor out of loop)
- Undeprecate NetDef.num_workers, no one is serious about fixing these
- Suppress warnings about deprecated pthreadpool types
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D22234138
Pulled By: ezyang
fbshipit-source-id: 6a1601b6d7551a7e6487a44ae65b19acdcb7b849
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39341
This PR introduces neon backend for vec256 class for float datatype.
For now only aarch64 is enabled due to few issues with enabling in
aarch32 bit.
Test Plan:
vec256_test
Imported from OSS
Differential Revision: D21822399
fbshipit-source-id: 3851c4336d93d1c359c85b38cf19904f82bc7b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40059
This benchmark is added specifically for mobile to see if compiler is
autovectorizing and thus we have no advantage of neon backend for vec256
for add op.
Test Plan:
CI
Imported from OSS
Differential Revision: D22055146
fbshipit-source-id: 43ba6c4ae57c6f05d84887c2750ce21ae1b0f0b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41103
add a CLANG_CODE_COVERAGE option to CMakeList. If the option is ON, add code coverage needed compile flags.
Test Plan:
Clone pytorch source code to local, modified these changes and builded it with `CLANG_CODE_COVERAGE ON` and `BUILD_TESTS ON`. Run a manual test and attach code coverage report.
{F243609020}
Reviewed By: malfet
Differential Revision: D22422513
fbshipit-source-id: 27a31395c31b5b5f4b72523954722771d8f61080
Summary:
This re-applies D21232894 (b9d3869df3) and D22162524, plus updates jni_deps in a few places
to avoid breaking host JNI tests.
Test Plan: `buck test @//fbandroid/mode/server //fbandroid/instrumentation_tests/com/facebook/caffe2:host-test`
Reviewed By: xcheng16
Differential Revision: D22199952
fbshipit-source-id: df13eef39c01738637ae8cf7f581d6ccc88d37d5