Commit Graph

1141 Commits

Author SHA1 Message Date
Jeff Daily
64b543434d [ROCm] update cmake package DIR paths (#77087)
Fixes nightly libtorch builds.  As of ROCm 5.1.x, all *.cmake files are under /opt/rocm/lib/cmake/package instead of /opt/rocm/package/lib/cmake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77087
Approved by: https://github.com/seemethere
2022-05-10 19:06:51 +00:00
sanchitintel
4ee29d6033 [Reland take-2] Add JIT graph fuser for oneDNN Graph API (v0.5)
Re-landing #68111/#74596

## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).

On the basis of #50256, the below improvements are included:

 * The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
 * The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.

 ### User API:
The optimization pass is disabled by default. Users could enable it by:

```
 torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.

 ### Performance:
 [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:

 * SkyLake 8180 (1 socket of 28 cores):
   ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)
* SkyLake 8180 (single thread):
   ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)
   * By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
   ** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops

 ### Directory structure of the integration code
 Fuser-related code is placed under:

 ```
 torch/csrc/jit/codegen/onednn/
 ```

 Optimization pass registration is done in:

 ```
 torch/csrc/jit/passes/onednn_graph_fuser.h
 ```

 CMake for the integration code is in:

 ```
 caffe2/CMakeLists.txt
 cmake/public/mkldnn.cmake
 cmake/Modules/FindMKLDNN.cmake
 ```

 ## Limitations
 * In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
 * We have only optimized the inference use-case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
2022-05-05 16:57:03 +00:00
Nikita Shulga
8473173c36 Remove breakpad dependency
This functionality does not seem to be used
and there are some requests to update dependency.

Add `third_party` to torch_cpu include directories if compiling with
Caffe2 support, as `caffe2/quantization/server/conv_dnnlowp_op.cc` depends on `third_party/fbgemm/src/RefImplementations.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
2022-05-03 20:21:55 +00:00
Shabab Ayub
b6bc5b325a [torch deploy] remove torch deploy being added to "torch libraries" (doesnt work)
Summary: This doesn't work because libtorch_deploy.so isn't a thing anymore.

Test Plan: tested in OSS, this does not break the build

Differential Revision: D35919131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76363
Approved by: https://github.com/PaliC, https://github.com/osalpekar
2022-05-02 18:12:55 +00:00
PyTorch MergeBot
3dcd67a1b3 Revert "[Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1)"
This reverts commit 8b11d81058.

Reverted https://github.com/pytorch/pytorch/pull/74596 on behalf of https://github.com/janeyx99
2022-04-29 15:40:17 +00:00
chunyuan
8b11d81058 [Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1)
Re-landing https://github.com/pytorch/pytorch/pull/68111

## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).

On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:

- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.

### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```

### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)

- SkyLake 8180 (single thread):

  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)
 \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
  \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops

### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```

Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```

CMake for the integration code is:
```
caffe2/CMakeLists.txt
```

## Limitations

- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74596
Approved by: https://github.com/malfet
2022-04-29 01:01:33 +00:00
anjali411
b204ad863f Revert "Revert "Allow specifying tags for aten operators in native_functions.yaml""
This reverts commit ea44645c9a.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76456

Approved by: https://github.com/osalpekar
2022-04-28 02:04:57 +00:00
Sam Spilsbury
111b2bf9da [cmake] Use list(APPEND instead of string(APPEND for vulkan codegen args
This was causing the shaders to be incorrectly templated because
both the precision argument and the format argument were being treated
as a single argument by argparse and therefore pasted into shaders
incorrectly. In turn this meant that shaders couldn't be compiled
when the precision or format options were turned on.

Fixes #76195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76196
Approved by: https://github.com/dagitses
2022-04-25 17:36:36 +00:00
jason_w
bc9bba9b43 delete ${GEN_VULKAN_FLAGS}
'--vulkan' no longer exists in 'tools/codegen/gen.py', was deleted in #46938.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76080
Approved by: https://github.com/albanD
2022-04-25 17:21:43 +00:00
Edward Yang
36420b5e8c Rename tools/codegen to torchgen (#76275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76275

In preparation for addressing
https://github.com/pytorch/pytorch/issues/73212

Diff was generated with:

```
git mv tools/codegen torchgen
git grep -l 'tools.codegen' | xargs sed -i 's/tools.codegen/torchgen/g'
sed -i "s/\${TOOLS_PATH}\/codegen/\${TORCH_ROOT}\/torchgen/g" caffe2/CMakeLists.txt
```

and a manual edits to:

* tools/test/test_gen_backend_stubs.py
* torchgen/build.bzl
* torchgen/gen_backend_stubs.py

aka this diff:

```
 diff --git a/tools/test/test_gen_backend_stubs.py b/tools/test/test_gen_backend_stubs.py
index 3dc26c6d2d..104054575e 100644
 --- a/tools/test/test_gen_backend_stubs.py
+++ b/tools/test/test_gen_backend_stubs.py
@@ -9,7 +9,7 @@ from torchgen.gen_backend_stubs import run
 from torchgen.gen import _GLOBAL_PARSE_NATIVE_YAML_CACHE  # noqa: F401

 path = os.path.dirname(os.path.realpath(__file__))
-gen_backend_stubs_path = os.path.join(path, '../torchgen/gen_backend_stubs.py')
+gen_backend_stubs_path = os.path.join(path, '../../torchgen/gen_backend_stubs.py')

 # gen_backend_stubs.py is an integration point that is called directly by external backends.
 # The tests here are to confirm that badly formed inputs result in reasonable error messages.
 diff --git a/torchgen/build.bzl b/torchgen/build.bzl
index ed04e35a43..d00078a3cf 100644
 --- a/torchgen/build.bzl
+++ b/torchgen/build.bzl
@@ -1,6 +1,6 @@
 def define_targets(rules):
     rules.py_library(
-        name = "codegen",
+        name = "torchgen",
         srcs = rules.glob(["**/*.py"]),
         deps = [
             rules.requirement("PyYAML"),
@@ -11,6 +11,6 @@ def define_targets(rules):

     rules.py_binary(
         name = "gen",
-        srcs = [":codegen"],
+        srcs = [":torchgen"],
         visibility = ["//visibility:public"],
     )
 diff --git a/torchgen/gen_backend_stubs.py b/torchgen/gen_backend_stubs.py
index c1a672a655..beee7a15e0 100644
 --- a/torchgen/gen_backend_stubs.py
+++ b/torchgen/gen_backend_stubs.py
@@ -474,7 +474,7 @@ def run(
 ) -> None:

     # Assumes that this file lives at PYTORCH_ROOT/torchgen/gen_backend_stubs.py
-    pytorch_root = pathlib.Path(__file__).parent.parent.parent.absolute()
+    pytorch_root = pathlib.Path(__file__).parent.parent.absolute()
     template_dir = os.path.join(pytorch_root, "aten/src/ATen/templates")

     def make_file_manager(install_dir: str) -> FileManager:
```

run_all_fbandroid_tests

Test Plan: sandcastle

Reviewed By: albanD, ngimel

Differential Revision: D35770317

fbshipit-source-id: 153ac4a7fef15b1e750812a90bfafdbc8f1ebcdf
(cherry picked from commit c6d485d1d4648fa1c8a4c14c5bf3d8e899b9b4dd)
2022-04-25 01:38:06 +00:00
Priya Ramani
be3ad8c637 [PyTorch][2/4] Support static dispatch with multiple backends (#75605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75605

Usecase: Milan models have multiple backends and need to use static dispatch to save on static initialization time and to hit native functions directly from the unboxed APIs.

This change passes in List[BackendIndex] and adds ability to generate code for multiple static backends with 1 or 0 kernels
ghstack-source-id: 154525738

(Note: this ignores all push blocking failures!)

Test Plan:
Builds lite_predictor_flatbuffer with multiple backends

```
buck build --config pt.enable_lightweight_dispatch=1 --config pt.static_dispatch_backend=CPU,QuantizedCPU,CompositeExplicitAutograd //xplat/caffe2/fb/lite_predictor:lite_predictor_flatbuffer
```

Reviewed By: larryliu0820

Differential Revision: D35510644

fbshipit-source-id: f985718ad066f8578b006b4759c4a3bd6caac176
(cherry picked from commit a6999729c8cc26c54b8d5684f6585d6c50d8d913)
2022-04-22 18:35:48 +00:00
Peter Bell
653892e288 Kineto: Don't search for CUPTI in default paths
Should fix #75369

Searching the default system paths may point to different cuda toolkit
versions, so we should restrict the search to only the paths passed explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76188

Approved by: https://github.com/ezyang
2022-04-22 01:08:55 +00:00
Nikita Shulga
f6c275f55d Remove -Wno-unused-variable from utils.cmake (take 2) (#75538)
Summary:
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538

Reviewed By: anjali411

Differential Revision: D35747333

Pulled By: malfet

fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626
(cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)
2022-04-20 17:41:59 +00:00
PyTorch MergeBot
5c56b2286b Revert "Remove -Wno-unused-variable from utils.cmake"
This reverts commit 018cbe1f5c.

Reverted https://github.com/pytorch/pytorch/pull/75538 on behalf of https://github.com/seemethere
2022-04-19 17:19:09 +00:00
Nikita Shulga
018cbe1f5c Remove -Wno-unused-variable from utils.cmake
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538
Approved by: https://github.com/cpuhrsch
2022-04-19 15:26:55 +00:00
PyTorch MergeBot
d79d9fa283 Revert "Remove breakpad dependency"
This reverts commit 9aa3c7fd83.

Reverted https://github.com/pytorch/pytorch/pull/75394 on behalf of https://github.com/malfet
2022-04-17 17:58:51 +00:00
Nikita Shulga
9aa3c7fd83 Remove breakpad dependency
This functionality does not seem to be used
and there are some requests to update dependency

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
2022-04-17 17:43:45 +00:00
Min Si
42b4d0e934 [caffe2] remove unecessary RCCL dependency
Summary:
RCCL is required by two components in hipified Pytorch: (1) gloo and (2) hipified ProcessGroupNCCL.

- For (1) the RCCL dependency is managed in `./third_party/gloo/cmake/Dependencies.cmake` and can be enabled/disabled via `USE_RCCL`.
- For (2) the RCCL dependency is managed via `./cmake/Dependencies.cmake` and can be on/off via `USE_NCCL`.

The additional dependency removed in this commit forced hipified Pytorch to load librccl.so even when USE_RCCL=OFF USE_NCCL=OFF is set, i.e., when using torch_ucc/ucc for AMD GPU mem type. This caused conflicts when we use a non-system default librccl.so (i.e., not in ROCM_PATH) for torch_ucc/ucc.

This commit removes the unnecessary RCCL dependency. This will ensure a cleaner way to use torch_ucc with a user-specified RCCL library.

Test Plan:
## Verify OSS pytorch on an AMD GPU machine (MI100)
```
ROCM_PATH=/opt/rocm-4.5.2
git clone https://github.com/pytorch/pytorch.git
cd pytorch
python3 tools/amd_build/build_amd.py
USE_NCCL=0 USE_RCCL=0 USE_KINETO=0 with-proxy python3 setup.py develop
USE_NCCL=0 USE_RCCL=0 USE_KINETO=0 with-proxy python3 setup.py install
```
log for develop: P492778257
log for install: P492778277

## Verify OSS pytorch + TorchUCC on an AMD GPU machine (MI100)
```
export RCCL_INSTALL_DIR=/opt/rccl-rocm-rel-4.4
git clone https://github.com/facebookresearch/torch_ucc.git
cd torch_ucc
UCX_HOME=$RCCL_INSTALL_DIR UCC_HOME=$RCCL_INSTALL_DIR WITH_CUDA=$ROCM_PATH python setup.py

# run param comm
export HSA_ENABLE_SDMA=0
export LD_LIBRARY_PATH=$RCCL_INSTALL_DIR

cd test
git clone https://github.com/facebookresearch/param
cd ..
/bin/bash ./test/start_test.sh ./test/param/train/comms/pt/comms.py --backend ucc --device cuda --b 4 --e 4M --c 1 --collective all_reduce
```
- log for param comm: P493033836
- Verified librccl.so in `/opt/rccl-rocm-rel-4.4` is used via checking version string in log. "[localbuild]" is added in RCCL source.
```
RCCL version 2.9.9+hip4.4 [localbuild]
```

Differential Revision: D35476911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75547
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2022-04-12 17:45:08 +00:00
Nikita Shulga
80ea6955af Add cuda-11.3+clang9 build workflow (take 2)
To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic)

Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu`

Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests
Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293
Approved by: https://github.com/atalman, https://github.com/seemethere
2022-04-11 17:13:01 +00:00
PyTorch MergeBot
8fe43d76d5 Revert "Add cuda-11.3+clang9 build workflow"
This reverts commit 709fcc862e.

Reverted https://github.com/pytorch/pytorch/pull/75293 on behalf of https://github.com/janeyx99
2022-04-11 15:24:59 +00:00
Nikita Shulga
709fcc862e Add cuda-11.3+clang9 build workflow
To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic)

Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu`

Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests
Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293
Approved by: https://github.com/atalman, https://github.com/seemethere
2022-04-11 14:10:57 +00:00
PyTorch MergeBot
ea44645c9a Revert "Allow specifying tags for aten operators in native_functions.yaml"
This reverts commit 1dab71ab25.

Reverted https://github.com/pytorch/pytorch/pull/72549 on behalf of https://github.com/malfet
2022-03-28 18:04:38 +00:00
anjali411
1dab71ab25 Allow specifying tags for aten operators in native_functions.yaml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72549

Approved by: https://github.com/ezyang
2022-03-25 21:17:52 +00:00
Xiang Gao
3b29bd00eb Make ProcessGroupNCCL load torch_ucc.so when TORCH_UCC_LIBRARY_PATH is set (#69552)
Summary:
This is the very first step for the UCC-NCCL integration. This PR lets `ProcessGroupNCCL` load the `torch_ucc.so` if the user specifies an environmental variable `TORCH_UCC_LIBRARY_PATH`. If this environment variable is not specified by the user, then there will be no visible change.

In the future, we may want to make PyTorch smart enough to automatically detect the `torch_ucc.so` in the user's system, but before doing that, I believe we should first make sure that `ProcessGroupUCC` is very well tested.

Note that in this PR, `ProcessGroupNCCL` just loads the library but will not use it. I am trying to make PRs small, so the usage of `torch_ucc.so` will be submitted in later PRs.

This PR requires the change in https://github.com/facebookresearch/torch_ucc/pull/56, otherwise `torch_ucc.so` can not be successfully loaded. But his PR can be landed separately without waiting for https://github.com/facebookresearch/torch_ucc/pull/56 because, in PyTorch's unit tests, UCC is never used or tested.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69552

Reviewed By: mruberry

Differential Revision: D34675212

Pulled By: jiayisuse

fbshipit-source-id: a3d1fb98340dbe3a931af555423863efd381f1ae
(cherry picked from commit 3778b6fabe70c26b5a65e6ddec641d2ef9113cd1)
2022-03-25 18:19:39 +00:00
Will Constable
3547f20872 Land remaining parts of Torchscript Lazy Tensor backend (#74111)
Summary:
Also enables bazel build to run lazy codegen.  Bazel (oss) build feeds off the same filelists as cmake/buck (build_variables.bzl), so enabling it is easier than keeping it disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74111

Test Plan: Run CI and verify test_lazy_ops is running via OSS cmake builds

Reviewed By: bdhirsh

Differential Revision: D34772403

fbshipit-source-id: 8a63f58b9536e6ac1be530667932176ef2549496
(cherry picked from commit e807ffb1918853d10b924fdc24f85ee5b1a39021)
2022-03-22 23:14:03 +00:00
Michael Suo
e5bf87963d Revert D34584878: [pytorch][PR] Add JIT graph fuser for oneDNN Graph API (Preview4)
Test Plan: revert-hammer

Differential Revision:
D34584878 (7dd0823011)

Original commit changeset: ce817aa8cc90

Original Phabricator Diff: D34584878 (7dd0823011)

fbshipit-source-id: a941aaad34f8fe5f0c51f719f9f5c29b811c4d5b
(cherry picked from commit a43262ec7521b1665b02a64d3f279e72ee2344b9)
2022-03-21 23:07:14 +00:00
chunyuan
7dd0823011 Add JIT graph fuser for oneDNN Graph API (Preview4) (#68111)
Summary:
## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).

On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:

- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.

### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```

### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)

- SkyLake 8180 (single thread):

  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)
 \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
  \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops

### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```

Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```

CMake for the integration code is:
```
caffe2/CMakeLists.txt
```

## Limitations

- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111

Reviewed By: eellison

Differential Revision: D34584878

Pulled By: malfet

fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4
(cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)
2022-03-21 22:12:19 +00:00
Nikita Shulga
14dcb5a1a0 Fix asmjit compilation with clang-13
By suppressed `deprecated-copy` and `unused-but-set-variable` warnings, otherwise compilation fails with implicit default copy constructor:
```
/Users/malfet/git/pytorch/pytorch/third_party/fbgemm/third_party/asmjit/src/asmjit/core/../core/radefs_p.h:174:22: error: definition of implicit copy constructor for 'RARegCount' is deprecated because it has a user-declared copy assignment operator [-Werror,-Wdeprecated-copy]
  inline RARegCount& operator=(const RARegCount& other) noexcept = default;
```

Fixes https://github.com/pytorch/pytorch/issues/74352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74379
Approved by: https://github.com/seemethere, https://github.com/atalman
2022-03-17 17:09:07 +00:00
Edward Z. Yang
493bbdc4fe Use shared CUPTI by default
Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols.  Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.

Signed-off-by: Edward Z. Yang <ezyangfb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74009

Approved by: https://github.com/malfet
2022-03-16 21:04:12 +00:00
Ashwin Hari
7ed73b2803 CMake option for using static MKL libraries
Fixes #70587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73069
Approved by: https://github.com/malfet
2022-03-07 19:32:33 +00:00
Mengwei Liu
9ce9803abe [PyTorch] Add codegen unboxing ability (#69881)
Summary:
RFC: https://github.com/pytorch/rfcs/pull/40

This PR (re)introduces python codegen for unboxing wrappers. Given an entry of `native_functions.yaml` the codegen should be able to generate the corresponding C++ code to convert ivalues from the stack to their proper types. To trigger the codegen, run
```
tools/jit/gen_unboxing.py -d cg/torch/share/ATen
```

Merged changes on CI test. In https://github.com/pytorch/pytorch/issues/71782 I added an e2e test for static dispatch + codegen unboxing. The test exports a mobile model of mobilenetv2, load and run it on a new binary for lite interpreter: `test/mobile/custom_build/lite_predictor.cpp`.

## Lite predictor build specifics

1. Codegen: `gen.py` generates `RegisterCPU.cpp` and `RegisterSchema.cpp`. Now with this PR, once `static_dispatch` mode is enabled, `gen.py` will not generate `TORCH_LIBRARY` API calls in those cpp files, hence avoids interaction with the dispatcher. Once `USE_LIGHTWEIGHT_DISPATCH` is turned on, `cmake/Codegen.cmake` calls `gen_unboxing.py` which generates `UnboxingFunctions.h`, `UnboxingFunctions_[0-4].cpp` and `RegisterCodegenUnboxedKernels_[0-4].cpp`.
2. Build: `USE_LIGHTWEIGHT_DISPATCH` adds generated sources into `all_cpu_cpp` in `aten/src/ATen/CMakeLists.txt`. All other files remain unchanged. In reality all the `Operators_[0-4].cpp` are not necessary but we can rely on linker to strip them off.

## Current CI job test coverage update

Created a new CI job `linux-xenial-py3-clang5-mobile-lightweight-dispatch-build` that enables the following build options:
* `USE_LIGHTWEIGHT_DISPATCH=1`
* `BUILD_LITE_INTERPRETER=1`
* `STATIC_DISPATCH_BACKEND=CPU`

This job triggers `test/mobile/lightweight_dispatch/build.sh` and builds `libtorch`. Then the script runs C++ tests written in `test_lightweight_dispatch.cpp` and `test_codegen_unboxing.cpp`. Recent commits added tests to cover as many C++ argument type as possible: in `build.sh` we installed PyTorch Python API so that we can export test models in `tests_setup.py`. Then we run C++ test binary to run these models on lightweight dispatch enabled runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69881

Reviewed By: iseeyuan

Differential Revision: D33692299

Pulled By: larryliu0820

fbshipit-source-id: 211e59f2364100703359b4a3d2ab48ca5155a023
(cherry picked from commit 58e1c9a25e3d1b5b656282cf3ac2f548d98d530b)
2022-03-01 23:28:13 +00:00
Andrey Talman
197764b35d Remove cuda 11.1 references (#73514)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/73377

We've migrated to CUDA-11.3 as default toolkit in 1.9, it's time to stop builds (especially considering forward-compatibility guarantee across CUDA-11.x drivers)

Hence we are removing CUDA 11.1 support. We should also cleanup old cuda related code from our builder and pytorch repo making scripts a little more clean.

We have code that references cuda 9.2 , 10.1 , 11.0, 11.1, 11.2 and none of these are currently use

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73514

Reviewed By: janeyx99

Differential Revision: D34551989

Pulled By: atalman

fbshipit-source-id: 9ceaaa9b25ad49689986f4b29a26d20370d9d011
(cherry picked from commit fe109c62daf429e9053c03f6e374568ba23cd041)
2022-03-01 16:37:37 +00:00
Shintaro Iwasaki
7dc2cfa249 [c10][rocm] fix __assert_fail() declaration mismatch error (#73040)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73040

This patch fixes a compilation error in PyTorch with ROCm when `NDEBUG` is passed.

## Problem

Forward declaration of `__host__ __device__ __assert_fail()` is used in `c10/macros/Macros.h` for HIP compilation when `NDEBUG` is set  However, HIP has  `__device__ __assert_fail()` in `hip/amd_detail/amd_device_functions.h`, causing a function type error.

This issue does not appear in ROCm CI tests since it happens only when `NDEBUG` is passed.

## Solution

[EDIT] After the discussion on GitHub, we chose to entirely disable `CUDA_KERNEL_ASSERT()` for ROCm.

 ---

To solve this compilation error, this patch disables `CUDA_KERNEL_ASSERT()`, which uses `__assert_fail()` when
1. `c10/macros/Macros.h` is included for `*.hip` (precisely speaking, `__HIP__` or `__HIP_ARCH__` is defined), and
2. `NDEBUG` is passed.

Note that there's no impact on default compilation because, without a special compilation flag, those HIP files are compiled without `-NDEBUG`. And that's why this issue has not been found.

### Justification
[1] We cannot declare one host-and-device function for two separate host and device functions.
```
__device__ int func() {return 0};
__host__ int func() {return 0};
// Compile error (hipcc)
// __device__ __host__ int func();
```
[2] Forward declaration of a correct `__device__` only `__assert_fail()` for `__HIP__` causes the following error:
```
pytorch/c10/util/TypeCast.h:135:7: error: reference to __device__ function '__assert_fail' in __host__ __device__ function
      ERROR_UNSUPPORTED_CAST
      ^
pytorch/c10/util/TypeCast.h:118:32: note: expanded from macro 'ERROR_UNSUPPORTED_CAST'
#define ERROR_UNSUPPORTED_CAST CUDA_KERNEL_ASSERT(false);
                               ^
pytorch/c10/macros/Macros.h:392:5: note: expanded from macro 'CUDA_KERNEL_ASSERT'
    __assert_fail(
```

[3] Maybe there's a way to properly define `__assert_fail()` for HIP + NDEBUG, but this might be too much. Please let me just disable it.

### Technical details

Error
```
pytorch/c10/macros/Macros.h:368:5: error: __host__ __device__ function '__assert_fail' cannot overload __device__ function '__assert_fail'
    __assert_fail(
    ^
/opt/rocm/hip/include/hip/amd_detail/amd_device_functions.h:1173:6: note: previous declaration is here
void __assert_fail(const char *assertion,
```

CUDA definition (9.x) of `__assert_fail()`
```
#elif defined(__GNUC__)
extern __host__ __device__ __cudart_builtin__ void __assert_fail(
  const char *, const char *, unsigned int, const char *)
  __THROW;
```

ROCm definition (the latest version)
```
// 2b59661f3e/include/hip/amd_detail/amd_device_functions.h (L1172-L1177)
extern "C" __device__ __attribute__((noinline)) __attribute__((weak))
void __assert_fail(const char *assertion,
                   const char *file,
                   unsigned int line,
                   const char *function);
```

Test Plan:
CI + reproducer
```
python3 tools/amd_build/build_amd.py
python3 setup.py develop --cmake-only
cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
cmake --build build
```

Reviewed By: xw285cornell

Differential Revision: D34310555

fbshipit-source-id: 7542288912590533ced3f20afd2e704b6551991b
(cherry picked from commit 9e52196e36820abe36bf6427cabc7389d3ea6cb5)
2022-03-01 04:35:30 +00:00
Edward Yang
ce7910ba81 ufunc codegen (#65851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65851

Design doc: https://docs.google.com/document/d/12rtlHnPUpaJ-I52Iob3L0WA3rKRr_OY7fXqeCvn2MVY/edit

First read the design doc to understand the user syntax. In this PR, we have converted add to use ufunc codegen; most of the cpp changes are deleting the preexisting implementations of add, and ufunc/add.h are the new implementations in the ufunc format.

The bulk of this PR is in the new codegen machinery. Here's the order to read the files:

* `tools/codegen/model.py`
  * Some self-explanatory utility classes: `ScalarType`, `DTYPE_CLASSES`
  * New classes for representing ufunc entries in `native_functions.yaml`: `UfuncKey` and  `UfuncInnerLoop`, as well as parsing logic for these entries. UfuncKey has some unusual entries (e.g., CPUScalar) that don't show up in the documentation, more on these below).
  * A predicate `is_ufunc_dispatch_key` for testing which dispatch keys should get automatically generated when an operator opts into ufuncs (CPU and CUDA, for now!)
* `tools/codegen/api/types.py`
  * More self-explanatory utility stuff: ScalarTypeToCppMapping mapping ScalarType to CppTypes; Binding.rename for changing the name of a binding (used when we assign constructor variables to member variables inside CUDA functors)
  * New VectorizedCType, representing `at::vec::Vectorized<T>`. This is used inside vectorized CPU codegen.
  * New `scalar_t` and `opmath_t` BaseCppTypes, representing template parameters that we work with when doing codegen inside ufunc kernel loops (e.g., where you previously had Tensor, now you have `scalar_t`)
  * `StructuredImplSignature` represents a `TORCH_IMPL_FUNC` definition, and straightforwardly follows from preexisting `tools.codegen.api.structured`
* `tools/codegen/translate.py` - Yes, we use translate a LOT in this PR. I improved some of the documentation, the only substantive changes are adding two new conversions: given a `scalar_t` or a `const Scalar&`, make it convertible to an `opmath_t`
* `tools/codegen/api/ufunc.py`
  * OK, now we're at the meaty stuff. This file represents the calling conventions of three important concepts in ufunc codegen, which we'll describe shortly. All of these APIs are relatively simple, since there aren't any complicated types by the time you get to kernels.
  * stubs are the DispatchStub trampolines that CPU kernels use to get to their vectorized versions. They drop all Tensor arguments (as they are in TensorIterator) but otherwise match the structured calling convention
  * ufuncs are the inner loop template functions that you wrote in ufunc/add.h which do the actual computation in question. Here, all the Tensors and Scalars have been converted into the computation type (`opmath_t` in CUDA, `scalar_t` in CPU)
  * ufunctors are a CUDA-only concept representing functors that take some of their arguments on a host-side constructor, and the rest in the device-side apply. Once again, Tensors and Scalars are converted into the computation type, `opmath_t`, but for clarity all the functions take `scalar_t` as argument (as this is the type that is most salient at the call site). Because the constructor and apply are code generated separately, `ufunctor_arguments` returns a teeny struct `UfunctorBindings`
* `tools/codegen/dest/ufunc.py` - the workhorse. This gets its own section below.
* `tools/codegen/gen.py` - just calling out to the new dest.ufunc implementation to generate UfuncCPU_add.cpp, UFuncCPUKernel_add.cpp and UfuncCUDA_add.cu files per ufunc operator. Each of these files does what you expect (small file that registers kernel and calls stub; CPU implementation; CUDA implementation). There is a new file manager for UFuncCPUKernel files as these need to get replicated by cmake for vectorization. One little trick to avoid recompilation is we directly replicate code generated forward declarations in these files, to reduce the number of headers we depend on (this is codegen, we're just doing the preprocessors job!)
* I'll talk about build system adjustments below.

OK, let's talk about tools/codegen/dest/ufunc.py. This file can be roughly understood in two halves: one for CPU code generation, and the other for CUDA code generation.

**CPU codegen.** Here's roughly what we want to generate:

```
// in UfuncCPU_add.cpp
using add_fn = void (*)(TensorIteratorBase&, const at::Scalar&);
DECLARE_DISPATCH(add_fn, add_stub);
DEFINE_DISPATCH(add_stub);

TORCH_IMPL_FUNC(ufunc_add_CPU)
(const at::Tensor& self, const at::Tensor& other, const at::Scalar& alpha, const at::Tensor& out) {
  add_stub(device_type(), *this, alpha);
}

// in UfuncCPUKernel_add.cpp
void add_kernel(TensorIteratorBase& iter, const at::Scalar& alpha) {
  at::ScalarType st = iter.common_dtype();
  RECORD_KERNEL_FUNCTION_DTYPE("add_stub", st);
  switch (st) {
    AT_PRIVATE_CASE_TYPE("add_stub", at::ScalarType::Bool, bool, [&]() {
      auto _s_alpha = alpha.to<scalar_t>();
      cpu_kernel(iter, [=](scalar_t self, scalar_t other) {
        return ufunc::add(self, other, _s_alpha);
      });
    })

    AT_PRIVATE_CASE_TYPE(
        "add_stub", at::ScalarType::ComplexFloat, c10::complex<float>, [&]() {
          auto _s_alpha = alpha.to<scalar_t>();
          auto _v_alpha = at::vec::Vectorized<scalar_t>(_s_alpha);
          cpu_kernel_vec(
              iter,
              [=](scalar_t self, scalar_t other) {
                return ufunc::add(self, other, _s_alpha);
              },
              [=](at::vec::Vectorized<scalar_t> self,
                  at::vec::Vectorized<scalar_t> other) {
                return ufunc::add(self, other, _v_alpha);
              });
        })

   ...
```

The most interesting change about the generated code is what previously was an `AT_DISPATCH` macro invocation is now an unrolled loop. This makes it easier to vary behavior per-dtype (you can see in this example that the entry for bool and float differ) without having to add extra condtionals on top.

Otherwise, to generate this code, we have to hop through several successive API changes:

* In TORCH_IMPL_FUNC(ufunc_add_CPU), go from StructuredImplSignature to StubSignature (call the stub). This is normal argument massaging in the classic translate style.
* In add_kernel, go from StubSignature to UfuncSignature. This is nontrivial, because we must do various conversions outside of the inner kernel loop. These conversions are done by hand, setting up the context appropriately, and then the final ufunc call is done using translate. (BTW, I introduce a new convention here, call on a Signature, for code generating a C++ call, and I think we should try to use this convention elsewhere)

The other piece of nontrivial logic is the reindexing by dtype. This reindexing exists because the native_functions.yaml format is indexed by UfuncKey:

```
   Generic: add (AllAndComplex, BFloat16, Half)
   ScalarOnly: add (Bool)
```

but when we do code generation, we case on dtype first, and then we generate a `cpu_kernel` or `cpu_kernel_vec` call. We also don't care about CUDA code generation (which Generic) hits. Do this, we lower these keys into two low level keys, CPUScalar and CPUVector, which represent the CPU scalar and CPU vectorized ufuncs, respectively (Generic maps to CPUScalar and CPUVector, while ScalarOnly maps to CPUScalar only). Reindexing then gives us:

```
    AllAndComplex:
        CPUScalar: add
        CPUVector: add
    Bool:
        CPUScalar: add
    ...
```

which is a good format for code generation, but too wordy to force native_functions.yaml authors to write. Note that when reindexing, it is possible for there to be a conflicting definition for the same dtype; we just define a precedence order and have one override the other, so that it is easy to specialize on a particular dtype if necessary. Also note that because CPUScalar/CPUVector are part of UfuncKey, technically you can manually specify them in native_functions.yaml, although I don't expect this functionality to be used.

**CUDA codegen.** CUDA code generation has many of the same ideas as CPU codegen, but it needs to know about functors, and stubs are handled slightly differently.  Here is what we want to generate:

```
template <typename scalar_t>
struct CUDAFunctorOnSelf_add {
  using opmath_t = at::opmath_type<scalar_t>;
  opmath_t other_;
  opmath_t alpha_;
  CUDAFunctorOnSelf_add(opmath_t other, opmath_t alpha)
      : other_(other), alpha_(alpha) {}
  __device__ scalar_t operator()(scalar_t self) {
    return ufunc::add(static_cast<opmath_t>(self), other_, alpha_);
  }
};

... two more functors ...

void add_kernel(TensorIteratorBase& iter, const at::Scalar & alpha) {
  TensorIteratorBase& iter = *this;
  at::ScalarType st = iter.common_dtype();
  RECORD_KERNEL_FUNCTION_DTYPE("ufunc_add_CUDA", st);
  switch (st) {
    AT_PRIVATE_CASE_TYPE("ufunc_add_CUDA", at::ScalarType::Bool, bool, [&]() {
      using opmath_t = at::opmath_type<scalar_t>;
      if (false) {
      } else if (iter.is_cpu_scalar(1)) {
        CUDAFunctorOnOther_add<scalar_t> ufunctor(
            iter.scalar_value<opmath_t>(1), (alpha).to<opmath_t>());
        iter.remove_operand(1);
        gpu_kernel(iter, ufunctor);
      } else if (iter.is_cpu_scalar(2)) {
        CUDAFunctorOnSelf_add<scalar_t> ufunctor(
            iter.scalar_value<opmath_t>(2), (alpha).to<opmath_t>());
        iter.remove_operand(2);
        gpu_kernel(iter, ufunctor);
      } else {
        gpu_kernel(iter, CUDAFunctor_add<scalar_t>((alpha).to<opmath_t>()));
      }
    })

   ...

REGISTER_DISPATCH(add_stub, &add_kernel);

TORCH_IMPL_FUNC(ufunc_add_CUDA)
(const at::Tensor& self,
 const at::Tensor& other,
 const at::Scalar& alpha,
 const at::Tensor& out) {
  add_kernel(*this, alpha);
}

```

The functor business is the bulk of the complexity. Like CPU, we decompose CUDA implementation into three low-level keys: CUDAFunctor (normal, all CUDA kernels will have this), and CUDAFunctorOnOther/CUDAFunctorOnScalar (these are to support Tensor-Scalar specializations when the Scalar lives on CPU). Both Generic and ScalarOnly provide ufuncs for CUDAFunctor, but for us to also lift these into Tensor-Scalar specializations, the operator itself must be eligible for Tensor-Scalar specialization. At the moment, this is hardcoded to be all binary operators, but in the future we can use tags in native_functions.yaml to disambiguate (or perhaps expand codegen to handle n-ary operators).

The reindexing process not only reassociates ufuncs by dtype, but it also works out if Tensor-Scalar specializations are needed and codegens the ufunctors necessary for the level of specialization here (`compute_ufunc_cuda_functors`). Generating the actual kernel (`compute_ufunc_cuda_dtype_body`) just consists of, for each specialization, constructing the functor and then passing it off to `gpu_kernel`. Most of the hard work is in functor generation, where we take care to make sure `operator()` has the correct input and output types (which `gpu_kernel` uses to arrange for memory accesses to the actual CUDA tensor; if you get these types wrong, your kernel will still work, it will just run very slowly!)

There is one big subtlety with CUDA codegen: this won't work:

```
   Generic: add (AllAndComplex, BFloat16, Half)
   ScalarOnly: add_bool (Bool)
```

This is because, even though there are separate Generic/ScalarOnly entries, we only generate a single functor to cover ALL dtypes in this case, and the functor has the ufunc name hardcoded into it. You'll get an error if you try to do this; to fix it, just make sure the ufunc is named the same consistently throughout. In the code, you see this because after testing for the short circuit case (when a user provided the functor themselves), we squash all the generic entries together and assert their ufunc names are the same. Hypothetically, if we generated a separate functor per dtype, we could support differently named ufuncs but... why would you do that to yourself. (One piece of nastiness is that the native_functions.yaml syntax doesn't stop you from shooting yourself in the foot.)

A brief word about CUDA stubs: technically, they are not necessary, as there is no CPU/CPUKernel style split for CUDA kernels (so, if you look, structured impl actually calls add_kernel directly). However, there is some code that still makes use of CUDA stubs (in particular, I use the stub to conveniently reimplement sub in terms of add), so we still register it. This might be worth frying some more at a later point in time.

**Build system changes.** If you are at FB, you should review these changes in fbcode, as there are several changes in files that are not exported to ShipIt.

The build system changes in this patch are substantively complicated by the fact that I have to implement these changes five times:

* OSS cmake build
* OSS Bazel build
* FB fbcode Buck build
* FB xplat Buck build (selective build)
* FB ovrsource Buck build

Due to technical limitations in the xplat Buck build related to selective build, it is required that you list every ufunc header manually (this is done in tools/build_variables.bzl)

The OSS cmake changes are entirely in cmake/Codegen.cmake there is a new set of files cpu_vec_generated (corresponding to UfuncCPUKernel files) which is wired up in the same way as other files. These files are different because they need to get compiled multiple times under different vectorization settings. I adjust the codegen, slightly refactoring the inner loop into its own function so I can use different base path calculation depending on if the file is traditional (in the native/cpu folder) or generated (new stuff from this diff.

The Bazel/Buck changes are organized around tools/build_variables.bzl, which contain the canonical list of ufunc headers (aten_ufunc_headers), and tools/ufunc_defs.bzl (added to ShipIt export list in D34465699) which defines a number of functions that compute the generated cpu, cpu kernel and cuda files based on the headers list. For convenience, these functions take a genpattern (a string with a {} for interpolation) which can be used to easily reformat the list of formats in target form, which is commonly needed in the build systems.

The split between build_variables.bzl and ufunc_defs.bzl is required because build_variables.bzl is executed by a conventional Python interpreter as part of the OSS cmake, but we require Skylark features to implement the functions in ufunc_defs.bzl (I did some quick Googling but didn't find a lightweight way to run the Skylark interpreter in open source.)

With these new file lists, the rest of the build changes are mostly inserting references to these files wherever necessary; in particular, cpu kernel files have to be worked into the multiple vectorization build flow (intern_build_aten_ops in OSS Bazel). Most of the subtlety relates to selective build. Selective build requires operator files to be copied per overall selective build; as dhruvbird explains to me, glob expansion happens during the action graph phase, but the selective build handling of TEMPLATE_SOURCE_LIST is referencing the target graph. In other words, we can't use a glob to generate deps for another rule, because we need to copy files from wherever (included generated files) to a staging folder so the rules can pick them up.

It can be somewhat confusing to understand which bzl files are associated with which build. Here are the relevant mappings for files I edited:

* Used by everyone - tools/build_tools.bzl, tools/ufunc_defs.bzl
* OSS Bazel - aten.bzl, BUILD.bazel
* FB fbcode Buck - TARGETS
* FB xplat Buck -BUCK, pt_defs.bzl, pt_template_srcs.bzl
* FB ovrsource Buck - ovrsource_defs.bzl, pt_defs.bzl

Note that pt_defs.bzl is used by both xplat and ovrsource. This leads to the "tiresome" handling for enabled backends, as selective build is CPU only, but ovrsource is CPU and CUDA.

BTW, while I was at it, I beefed up fb/build_arvr.sh to also do a CUDA ovrsource build, which was not triggered previously.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31306586

Pulled By: ezyang

fbshipit-source-id: 210258ce83f578f79cf91b77bfaeac34945a00c6
(cherry picked from commit d65157b0b894b6701ee062f05a5f57790a06c91c)
2022-03-01 00:33:40 +00:00
Digant Desai
b2054d3025 Prepare for an update to the XNNPACK submodule (#72642)
Summary:
- Target Sha1: ae108ef49aa5623b896fc93d4298c49d1750d9ba
- Make USE_XNNPACK a dependent option on cmake minimum version 3.12
- Print USE_XNNPACK under cmake options summary, and print the
  availability from collet_env.py
- Skip XNNPACK based tests when XNNPACK is not available
    - Add SkipIfNoXNNPACK wrapper to skip tests
- Update cmake version for xenial-py3.7-gcc5.4 image to 3.12.4
    - This is required for the backwards compatibility test.
      The PyTorch op schema is XNNPACK dependent. See,
      aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp for
      example. The nightly version is assumed to have USE_XNNPACK=ON,
      so with this change we ensure that the test build can also
      have XNNPACK.
- HACK: skipping test_xnnpack_integration tests on ROCM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72642

Reviewed By: kimishpatel

Differential Revision: D34456794

Pulled By: digantdesai

fbshipit-source-id: 85dbfe0211de7846d8a84321b14fdb061cd6c037
(cherry picked from commit 6cf48e7b64d6979962d701b5d493998262cc8bfa)
2022-02-25 00:39:15 +00:00
Edward Yang
2321f26fa3 Move vectorized CPU codegen to after ATen codegen (#72869)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72869

The ordering here doesn't really matter, but in a future patch
I will make a change where vectorized CPU codegen does have to
be here, and moving it ahead of time (with no code changes)
will make the latter diff cleaner.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D34282229

Pulled By: ezyang

fbshipit-source-id: 3397cb0e062d63cc9853f6248f17c3558013798b
(cherry picked from commit 98c616024969f9df90c7fb09741ed9be7b7a20f1)
2022-02-23 20:33:19 +00:00
Andrey Talman
17b3ba148d
Set BLAS_LIBRARIES to ${MKL_LIBRARIES} for MKL case (#72806)
This reverts [suggestion](https://github.com/pytorch/pytorch/pull/49647#discussion_r677737470) proposed to https://github.com/pytorch/pytorch/pull/49647

Which is somehow sufficient to workaround symptoms of https://github.com/pytorch/pytorch/issue/72653 

I.e. before this change, `BLAS_LIBRARIES` were set to `caffe2::mkl`
which is an interface library with link property set as follows:
59dd84cab6/cmake/public/mkl.cmake (L10-L12)
2022-02-16 07:14:27 -08:00
Aaron Enye Shi
8a43aa9538 [Kineto][Bug Fix] Avoid picking up old CUPTI headers (#72761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72761

By default, the CUPTI_INCLUDE_DIR will pick up cupti.h from /usr/include which is old (from 2017 on AWS), and missing many cupti headers. Use NO_DEFAULT_PATH to avoid that, instead search from the list of locations provided.

Test Plan:
Fixes missing headers error when building on AWS. (Avoids old cupti.h from /usr/include). Instead uses cupti.h from cuda/extras/CUPTI/include.
```
In file included from /scratch/aaronshi/pytorch/third_party/kineto/libkineto/src/CuptiRangeProfilerApi.cpp:13:0:
/scratch/aaronshi/pytorch/third_party/kineto/libkineto/src/CuptiRangeProfilerApi.h:12:10: fatal error: cupti_profiler_target.h: No such file or directory
 #include <cupti_profiler_target.h>
          ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```
and
```
/scratch/aaronshi/pytorch/third_party/kineto/libkineto/src/CuptiRangeProfilerApi.cpp:7:10: fatal error: nvperf_host.h: No such file or directory
 #include <nvperf_host.h>
          ^~~~~~~~~~~~~~~
compilation terminated.
```

Reviewed By: briancoutinho

Differential Revision: D34191123

Pulled By: aaronenyeshi

fbshipit-source-id: d84f80308c9939ba8ed504e667847d136a261453
(cherry picked from commit 33368bd93b)
2022-02-15 22:43:03 +00:00
Douglas Lehr
eb4e6ca30c [ROCM] Add ROCM version api within cmake (#69481)
Summary:
In ROCm 5.0 and later the version of the ROCm platform can be obtained via
an api call vs reading from a flat file.

If the header file /opt/rocm/include/rocm_version.h exists,
LoadHIP.cmake compiles source referencing the api and prints out the
ROCM Versions.

If the file does not exist, LoadHIP.cmake will revert to the previous
approach of looking for the version-dev file.

Fixes #{issue number}

cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69481

Reviewed By: seemethere, janeyx99

Differential Revision: D34153435

Pulled By: malfet

fbshipit-source-id: f8c0650d27666d2a3cf47d812807798c47210b37
(cherry picked from commit 6cbb4f7a0c)
2022-02-11 00:15:10 +00:00
Peter Bell
bc1fb7a618 CMake: Limit python include directories to only python libraries (#69085)
Summary:
`include_directories` is old-style CMake which adds the include path to every file being compiled. This instead makes `python`, `numpy` and `pybind11` into targets that only `torch_python` and `caffe2_pybind_state` are linked to. So, python libraries can't be accidentally included elsewhere.

Resubmit of https://github.com/pytorch/pytorch/issues/65654, Closes https://github.com/pytorch/pytorch/issues/65828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69085

Reviewed By: anjali411

Differential Revision: D33776456

Pulled By: malfet

fbshipit-source-id: 018b0f6cd5a4f8c9e36df961deff832bc4afd479
(cherry picked from commit 57063107d6)
2022-02-07 21:18:32 +00:00
Andrey Talman
1e7d20eaea Remove forcing CUDNN_STATIC when CAFFE2_STATIC_LINK_CUDA (#72290)
Summary:
Remove forcing CUDNN_STATIC when CAFFE2_STATIC_LINK_CUDA is set
Since we are transitioning to using dynamic loading for multiple pytorch dependecies  and CUDNN is the first step in this transition,  hence we want to remove forcing CUDNN to statically load, and instead load it dynamically.

Tested using following workflow:
https://github.com/pytorch/pytorch/actions/runs/1790666862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72290

Reviewed By: albanD

Differential Revision: D34003793

Pulled By: atalman

fbshipit-source-id: 41bda7ac019a612ee53ceb18d1e372b1bb3cb68e
(cherry picked from commit 4a01940e68)
2022-02-04 14:35:53 +00:00
yanbing-j
4567d5ded4 Upgrade oneDNN to v2.5.2 (#71546)
Summary:
This PR upgrades oneDNN to v2.5.2, and includes some building support for oneDNN v2.5.2.

v2.4 changes:
- Improved performance for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved binary primitive performance for cases when one of the tensors is broadcasted.
- Improved performance of reduction primitive, reorder, shuffle primitives.
- Improved performance of depthwise convolution forward propagation for processors with Intel AVX5-12 support
- Improved performance of forward inner product primitive for the shapes with minibatch equal to 1 for processors with Intel AVX-512 support
- Improved performance of int8 matmul and inner product primitives for processors with Intel AVX2 and Intel DL Boost support

v2.5 changes:
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). The functionality is now enabled by default and requires Linux kernel 5.16.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.

v2.5.2 changes:
- Fixed performance regression in binary primitive with broadcast
- Fixed segmentation fault in depthwise convolution primitive for shapes with huge spatial size for processors with Intel AVX-512 support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71546

Reviewed By: george-qi

Differential Revision: D33827108

Pulled By: VitalyFedyunin

fbshipit-source-id: 8f5a19b331c82af5b0783f081e061e1034a93952
(cherry picked from commit 9705212fe9)
2022-02-01 18:34:58 +00:00
Can Balioglu
1cc824ef59 Fix old GCC ABI check in CMake package config (#72081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72081

This PR fixes the libstdc++ ABI check in CMake package configuration file (i.e. `TorchConfig.cmake`) The `_GLIBCXX_USE_CXX11_ABI` flag is a property of `libstdc++`, not GNU compiler collection. In its current form C++ libraries built with Clang on Linux fail since the `torch` CMake target propagates `_GLIBCXX_USE_CXX11_ABI` only when used with gcc.
ghstack-source-id: 148056323

Test Plan: Built a dummy C++ library that depends on libtorch with both gcc and clang on Linux

Reviewed By: malfet

Differential Revision: D33899849

fbshipit-source-id: 3e933b2c7a17d1fba086caa8aaec831223760882
(cherry picked from commit 41d18c64c4)
2022-02-01 13:21:00 +00:00
Peter Bell
847dbb8684 CMake: Clean up unused definitions (#69216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69216

This cleans up 4 pre-processor defines not used by any code:
- HAVE_GCC_GET_CPUID
- USE_GCC_GET_CPUID
- USE_AVX
- USE_AVX2

`cpuid` isn't used in PyTorch any more, we only use `cpuinfo`.
`USE_AVX*` is also not used, instead `HAVE_*_CPU_DEFINITIONS` tells
you which `CPU_CAPABILITY` flags are being compiled.

There is also `fbgemm`'s code path adding `third_party` as an include
path, despite `fbgemm` having a dedicated include directory and a
CMake setup that properly includes it.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33794424

Pulled By: malfet

fbshipit-source-id: 99d504af088818d4a26c2f6ce67ec0d59a5eb703
(cherry picked from commit 2e099d41f0)
2022-01-31 22:49:11 +00:00
Peter Bell
d693739248 CMake: Clean up unused definitions (#69216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69216

Currently `torch_cpu` has command line arguments relating to cuda
libraries e.g. `-DMAGMA_V2`. This happens because
`include_directories` and `add_definitions` indescriminately change
the compile commands of all targets.

Instead creating a proper magma target allows limiting the flags to
just `torch_cuda`.

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33794174

Pulled By: malfet

fbshipit-source-id: 762eabf3b9576bef94e8caa3ed4764c0e2c72b08
(cherry picked from commit f7d127b654)
2022-01-31 22:49:11 +00:00
Peter Bell
5045c18bd1 Error if pocketfft is not found (#67909)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67842

cc mruberry peterbell10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67909

Reviewed By: albanD

Differential Revision: D33759534

Pulled By: malfet

fbshipit-source-id: 03548c95fe233b812b303ce9603c20ff9f626c39
(cherry picked from commit 214624e254)
2022-01-31 17:29:48 +00:00
Han Qi
1bc3571078 [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer (#70201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70201

Included functions:
save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer module object

Compared to previous attempts, this diff only adds flatbuffer to cmake target and leaves fbcode/xplat ones unchanged.

Test Plan: unittest

Reviewed By: malfet, gmagogsfm

Differential Revision: D33239362

fbshipit-source-id: b9ca36b83d6af2d78cc50b9eb9e2a6fa7fce0763
2022-01-12 16:30:39 -08:00
Andrey Talman
6c4437118b Deprecating Python 3.6 (#70493)
Summary:
Deprecating python 3.6 from documentation and from cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70493

Reviewed By: suo

Differential Revision: D33433118

Pulled By: atalman

fbshipit-source-id: c3adc7b75714efdb5b6acda5d4cddc068fb4a145
2022-01-05 11:46:32 -08:00
Michael Suo
1adb70c6f0 Revert D33409880: [pytorch][PR] Deprecating Python 3.6
Test Plan: revert-hammer

Differential Revision:
D33409880 (d95be99561)

Original commit changeset: 4f9123398960

Original Phabricator Diff: D33409880 (d95be99561)

fbshipit-source-id: 32dc1c3c07ef99a04fab7d0fb742cf4e6c4b718a
2022-01-04 16:37:09 -08:00
Andrey Talman
d95be99561 Deprecating Python 3.6 (#70493)
Summary:
Deprecating python 3.6 from documentation and from cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70493

Reviewed By: malfet

Differential Revision: D33409880

Pulled By: atalman

fbshipit-source-id: 4f912339896096be95b344724a4d9ae88cdf1a8f
2022-01-04 14:41:27 -08:00
linuxone
f64906f470 ibm z14/15 SIMD support (#66407)
Summary:
https://github.com/pytorch/pytorch/issues/66406
implemented z arch 14/15 vector SIMD additions.
so far besides bfloat all other types have their SIMD implementation.

it has 99% coverage and currently passing the local test.
it is concise and the main SIMD file is only one header file
it's using template metaprogramming, mostly. but still, there are a few macrosses left with the intention not to modify PyTorch much
Sleef supports z15

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66407

Reviewed By: mrshenli

Differential Revision: D33370163

Pulled By: malfet

fbshipit-source-id: 0e5a57f31b22a718cd2a9ac59753fb468cdda140
2022-01-04 09:40:18 -08:00
Peter Bell
c34aa715fa AT_MKL_SEQUENTIAL and build changes (#70259)
Summary:
Re-land of  https://github.com/pytorch/pytorch/pull/69419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70259

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33246757

Pulled By: ngimel

fbshipit-source-id: 738f8558d4cad6752be14108f9931ec3514f6682
2021-12-22 13:52:23 -08:00
Peter Bell
4829dcea09 Codegen: Generate seperate headers per operator (#68247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68247

This splits `Functions.h`, `Operators.h`, `NativeFunctions.h` and
`NativeMetaFunctions.h` into seperate headers per operator base name.
With `at::sum` as an example, we can include:
```cpp
<ATen/core/sum.h>         // Like Functions.h
<ATen/core/sum_ops.h>     // Like Operators.h
<ATen/core/sum_native.h>  // Like NativeFunctions.h
<ATen/core/sum_meta.h>    // Like NativeMetaFunctions.h
```

The umbrella headers are still being generated, but all they do is
include from the `ATen/ops' folder.

Further, `TensorBody.h` now only includes the operators that have
method variants. Which means files that only include `Tensor.h` don't
need to be rebuilt when you modify function-only operators. Currently
there are about 680 operators that don't have method variants, so this
is potentially a significant win for incremental builds.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D32596272

Pulled By: albanD

fbshipit-source-id: 447671b2b6adc1364f66ed9717c896dae25fa272
2021-12-14 06:40:08 -08:00
Jithun Nair
8dfdc3df82 [ROCm] Refactor how to specify AMD gpu targets using PYTORCH_ROCM_ARCH (#61706)
Summary:
Remove all hardcoded AMD gfx targets

PyTorch build and Magma build will use rocm_agent_enumerator as
backup if PYTORCH_ROCM_ARCH env var is not defined

PyTorch extensions will use same gfx targets as the PyTorch build,
unless PYTORCH_ROCM_ARCH env var is defined

torch.cuda.get_arch_list() now works for ROCm builds

PyTorch CI dockers will continue to be built for gfx900 and gfx906 for now.

PYTORCH_ROCM_ARCH env var can be a space or semicolon separated list of gfx archs eg. "gfx900 gfx906" or "gfx900;gfx906"
cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61706

Reviewed By: seemethere

Differential Revision: D32735862

Pulled By: malfet

fbshipit-source-id: 3170e445e738e3ce373203e1e4ae99c84e645d7d
2021-12-13 15:41:40 -08:00
Yanan Cao
17f3179d60 Back out "[pytorch][PR] Add ability for a mobile::Module to save as flatbuffer" (#69796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69796

(Note: this ignores all push blocking failures!)

Test Plan: External CI + Sandcastle

Reviewed By: zhxchen17

Differential Revision: D33032671

fbshipit-source-id: dbf6690e960e25d6a5f19043cbe792add2acd7ef
2021-12-10 21:29:53 -08:00
Nikita Shulga
e305e4d4d8 Suppress common warnings when building by clang (#69710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69710

Namely no range-loop-analysis (that detect when loop variable can not be const reference

Test Plan: Imported from OSS

Reviewed By: r-barnes

Differential Revision: D32997003

Pulled By: malfet

fbshipit-source-id: dba0e7875e5b667e2cc394c70dd75e2403265918
2021-12-10 16:45:38 -08:00
Han Qi
d3649309e6 [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer (#69306)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69306

Included functions:

save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer
Module object

Test Plan: unittests

Reviewed By: gmagogsfm

Differential Revision: D32806835

fbshipit-source-id: 71913c6650e225634f878946bd16960d377a7f57
2021-12-09 14:53:31 -08:00
chunyuan
9ad05f2c3a Upgrade oneDNN to v2.3.3 and package oneDNN Graph API together (#63748)
Summary:
This PR upgrades oneDNN to [v2.3.3](https://github.com/oneapi-src/oneDNN/releases/tag/v2.3.3) and includes [Graph API preview release](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.2) in one package.

- oneDNN will be located at `pytorch/third_party/ideep/mkl-dnn/third_party/oneDNN`
- The version of oneDNN will be [v2.3.3](https://github.com/oneapi-src/oneDNN/releases/tag/v2.3.3)
  The main changes on CPU:

  - v2.3
    - Extended primitive cache to improve primitive descriptor creation performance.
    - Improved primitive cache performance in multithreaded configurations.
    - Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids).
    - Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats.
    - Improved performance of reduction primitive
    - Improved performance of depthwise convolution primitive with NHWC activations for training cases
  - v2.3.1
    -  Improved int8 GEMM performance for processors with Intel AVX2 and Intel DL Boost support
    - Fixed integer overflow for inner product implementation on CPUs
    - Fixed out of bounds access in GEMM implementation for Intel SSE 4.1
  - v2.3.2
    - Fixed performance regression in fp32 inner product primitive for processors with Intel AVX512 support
  - v2.3.3
    - Reverted check for memory descriptor stride validity for unit dimensions
    - Fixed memory leak in CPU GEMM implementation

  More changes can be found in https://github.com/oneapi-src/oneDNN/releases.
- The Graph API provides flexible API for aggressive fusion, and the preview2 supports fusion for FP32 inference.  See the [Graph API release branch](https://github.com/oneapi-src/oneDNN/tree/dev-graph-preview2) and [spec](https://spec.oneapi.io/onednn-graph/latest/introduction.html) for more details. A separate PR will be submitted to integrate the oneDNN Graph API to Torchscript graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63748

Reviewed By: albanD

Differential Revision: D32153889

Pulled By: malfet

fbshipit-source-id: 536071168ffe312d452f75d54f34c336ca3778c1
2021-12-09 13:42:40 -08:00
Sicheng Stephen Jia
bede33e3f5 [vulkan] Add image format qualifier to glsl files (#69330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69330

 ---

## Context

Previously, our shader files did not declare any [image format qualifiers](https://www.khronos.org/opengl/wiki/Layout_Qualifier_(GLSL)#Image_formats) for image layouts. This causes the SPIR-V modules produced to declare the [StorageImageWriteWithoutFormat](https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#_a_id_capability_a_capability) capability, which requires `shaderStorageImageWriteWithoutFormat` to be enabled in [VkPhysicalDeviceFeatures](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkPhysicalDeviceFeatures.html). `shaderStorageImageWriteWithoutFormat` is not available on some devices, causing errors to be reported by the Vulkan validation layer.

## Changes

Vulkan shaders now declare the image format explicitly so that the SPIR-V modules produced are compatible with devices that do not have `shaderStorageImageWriteWithoutFormat` enabled.

Test Plan: Imported from OSS

Reviewed By: beback4u

Differential Revision: D32840909

Pulled By: SS-JIA

fbshipit-source-id: 76e0a0da68b423ebc74ae7e839b9cfaf57d2cd39
2021-12-07 16:23:09 -08:00
Peter Bell
9a7732e852 CMake: Support dynamic codegen outputs (#68246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68246

Currently the codegen produces a list of output files at CMake
configuration time and the build system has no way of knowing if the
outputs change. So if that happens, you basically need to delete the
build folder and re-run from scratch.

Instead, this generates the output list every time the code generation
is run and changes the output to be a `.cmake` file that gets included
in the main cmake configuration step. That means the build system
knows to re-run cmake automatically if a new output is added. So, for
example you could change the number of shards that `Operators.cpp` is
split into and it all just works transparently to the user.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32596268

Pulled By: albanD

fbshipit-source-id: 15e0896aeaead90aed64b9c8fda70cf28fef13a2
2021-12-07 15:58:06 -08:00
Alban Desmaison
00ebbd5ef6 Revert D32010095: [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer
Test Plan: revert-hammer

Differential Revision:
D32010095 (41d35dc201)

Original commit changeset: d763b0557780

fbshipit-source-id: bf746a0389135c9f5f67f00f449435ce08fb5f6d
2021-12-02 06:41:40 -08:00
Han Qi
41d35dc201 Add ability for a mobile::Module to save as flatbuffer (#67351)
Summary:
Included functions:

* save_mobile_module -> saves a mobile::Module to flatbuffer
* load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
* parse_mobile_module -> parses from bytes or deserialized flatbuffer
      Module object

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67351

Reviewed By: iseeyuan

Differential Revision: D32010095

Pulled By: qihqi

fbshipit-source-id: d763b0557780f7c2661b6485105b045e41a5e8f1
2021-12-01 23:58:15 -08:00
Michael Suo
ad182479b0 [deploy] docs (#69251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69251

This adds some actual documentation for deploy, which is probably useful
since we told everyone it was experimentally available so they will
probably be looking at what the heck it is.

It also wires up various compoenents of the OSS build to actually work
when used from an external project.

Differential Revision:
D32783312
D32783312

Test Plan: Imported from OSS

Reviewed By: wconstab

Pulled By: suo

fbshipit-source-id: c5c0a1e3f80fa273b5a70c13ba81733cb8d2c8f8
2021-12-01 21:55:18 -08:00
Peter Bell
e534c5efd7 CMake: Include instead of copying cpu kernel files (#67656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67656

Currently, each cpu kernel file is copied into the build folder 3 times to give them different compilation flags. This changes it to instead generate 3 files that `#include` the original file. The biggest difference is that updating a copied file requires `cmake` to re-run, whereas include dependencies are natively handled by `ninja`.

A side benefit is that included files show up directly in the build dependency graph, whereas `cmake` file copies don't.

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D32566108

Pulled By: malfet

fbshipit-source-id: ae75368fede37e7ca03be6ade3d4e4a63479440d
2021-11-30 19:13:53 -08:00
Jiakai Liu
3dc0754c53 [pytorch][mobile] deprecate the LLVM-based static analyzer (#68180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68180

Since we've open sourced the tracing-based selective build, we can deprecate the
op-dependency-graph-based selective build and the static analyzer tool that
produces the dependency graph.
ghstack-source-id: 143108377

Test Plan: CIs

Reviewed By: seemethere

Differential Revision: D32358467

fbshipit-source-id: c61523706b85a49361416da2230ec1b035b8b99c
2021-11-11 16:37:08 -08:00
Peter Bell
4d601a1c36 codegen: Split up source, header and Declarations.yaml generation (#67497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67497

This allows more of the code-generation to happen in parallel, whereas
previously all codegen was serialized.

Test Plan: Imported from OSS

Reviewed By: dagitses, mruberry

Differential Revision: D32027250

Pulled By: albanD

fbshipit-source-id: 6407c4c3e25ad15d542aa73da6ded6a309c8eb6a
2021-11-03 13:20:54 -07:00
Gordon Fossum
ea4d983885 Modify "gemm" code to enable access to "sbgemm_" routine in OpenBLAS (#58831)
Summary:
OpenBLAS recently added support for bfloat16 GEMM, so this change has PyTorch call out to OpenBLAS for that, like it does for single and double precision

Our goal is to try to enable PyTorch to make calls to "sbgemm" in OpenBLAS.

We are prepared (if it is your preference) to add fences to the code to limit this change to the Power architecture,
but our first instinct is that anyone on any architecture that enables access to sbgemm in their OpenBLAS library
should be able to use this code.  (but again, we respect that as we are just starting to modify PyTorch, we respect
your guidance!)

(there is no issue number related to this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58831

Reviewed By: albanD

Differential Revision: D29951900

Pulled By: malfet

fbshipit-source-id: 3d0a4a638ac95b2ff2e9f6d08827772e28d397c3
2021-11-03 08:53:27 -07:00
Robert Blackwell
cee4e8f35d Add FlexiBLAS build support per #64752 (#64815)
Summary:
To enable building torch+dependencies, set WITH_BLAS=flexi BLAS=FlexiBLAS

Fixes https://github.com/pytorch/pytorch/issues/64752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64815

Reviewed By: jbschlosser

Differential Revision: D31997745

Pulled By: albanD

fbshipit-source-id: db208d59002f5896608a03132616400f09d972aa
2021-10-28 11:28:00 -07:00
Xiang Gao
b8dfb45ac2 Refactor cub namespace handling (#66219)
Summary:
This PR is to update PyTorch with the following cub changes:
- Starting cub 1.13.1, cub requires users to define `CUB_NS_QUALIFIER` if `CUB_NS_PREFIX` is also defined. Besides that, a new mechanism `CUB_WRAPPED_NAMESPACE` is added.

And I do the following change to PyTorch:
- Starting CUDA 11.5, define `CUB_WRAPPED_NAMESPACE` globally as an nvcc flag.
- Fix caffe2 failures caused by the above change.
- Add a `aten/src/ATen/cuda/cub_definitions.cuh` that defines helper macros about feature availability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66219

Reviewed By: bdhirsh

Differential Revision: D31626931

Pulled By: ngimel

fbshipit-source-id: 97ebf5ef671ade8bf46d0860edc317f22660f26d
2021-10-25 14:37:09 -07:00
Michael Suo
3ac2c74896 Revert D31082208: Use shared CUPTI by default
Test Plan: revert-hammer

Differential Revision:
D31082208 (8b0eae5aa8)

Original commit changeset: 14f66af92084

fbshipit-source-id: 0faff00832b7f79d476fd1f9f505142a548a76db
2021-10-12 14:37:54 -07:00
Edward Yang
8b0eae5aa8 Use shared CUPTI by default (#65401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65401

Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols.  Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gdankel

Differential Revision: D31082208

Pulled By: ezyang

fbshipit-source-id: 14f66af920847e158436b5801c43f3124b109b34
2021-10-12 11:01:40 -07:00
Nikita Shulga
c373387709 Update CMake and use native CUDA language support (#62445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445

PyTorch currently uses the old style of compiling CUDA in CMake which is just a
bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as
a language just like C++ or C.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31503350

fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55
2021-10-11 09:05:48 -07:00
Chen Lai
355acfdebc [PyTorch Edge][tracing-based] use operator.yaml to build libtorch library (#66237)
Summary:
https://pxl.cl/1QK3N
Enable using the yaml file from tracer to build libtorch library for ios and android.

1. Android:
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1  ./scripts/build_pytorch_android.sh x86
```
libtorch_lite.so x86: 3 MB (larger than H1, static is ~3.2 MB)

2. iOS
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1 BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR  ./scripts/build_ios.sh
```
Binary size: 7.6 MB
Size:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66237

ghstack-source-id: 140197164

Reviewed By: dhruvbird

Differential Revision: D31463119

fbshipit-source-id: c3f4eb71bdef1969eab6cb60999fec8547641cbd
2021-10-10 14:07:01 -07:00
Pruthvi Madugundu
085e2f7bdd [ROCm] Changes not to rely on CUDA_VERSION or HIP_VERSION (#65610)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610

- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.

- In the next PR
   - Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
   - HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.

cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd

Reviewed By: jbschlosser

Differential Revision: D30909053

Pulled By: ezyang

fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
2021-09-29 09:55:43 -07:00
Michael Suo
9b40eaaaab Revert D31193205: [pytorch][PR] CMake: Limit python include directories to only python libraries
Test Plan: revert-hammer

Differential Revision:
D31193205 (971c57f1d0)

Original commit changeset: 5c1b554a59d0

fbshipit-source-id: 5719b7df987ded6e7e212749a438db947656df87
2021-09-29 09:49:33 -07:00
Peter Bell
971c57f1d0 CMake: Limit python include directories to only python libraries (#65654)
Summary:
`include_directories` is old-style CMake which adds the include path to every file being compiled. This instead makes python, numpy and pybind11 into targets that only torch_python and caffe2_pybind_state are linked to. So, python libraries can't be accidentally included elsewhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65654

Reviewed By: gchanan

Differential Revision: D31193205

Pulled By: malfet

fbshipit-source-id: 5c1b554a59d0e441a701a04ebb62f0032d38b208
2021-09-29 08:09:08 -07:00
Nikita Shulga
399214efd6 Revert D31172530: [pytorch][PR] Enable CUPTI for kineto by default on windows
Test Plan: revert-hammer

Differential Revision:
D31172530 (6b60884f12)

Original commit changeset: 2c69ed0282c5

fbshipit-source-id: 649e040a8c44b0f536a8db397b4325309a285934
2021-09-24 19:18:15 -07:00
Guangyun Han
6b60884f12 Enable CUPTI for kineto by default on windows (#65608)
Summary:
Retry of https://github.com/pytorch/pytorch/pull/62175

See https://github.com/pytorch/pytorch/pull/62175#issuecomment-926411151 for more information.

malfet gdankel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65608

Reviewed By: zou3519

Differential Revision: D31172530

Pulled By: gdankel

fbshipit-source-id: 2c69ed0282c54fa6cdb6e604096d0370e230fd66
2021-09-24 13:00:49 -07:00
Nikita Shulga
bc02255d5e Revert D30721329: [pytorch][PR] Enable CUPTI for kineto by default on windows.
Test Plan: revert-hammer

Differential Revision:
D30721329 (7dbc21bc2b)

Original commit changeset: aa1af47df8cc

fbshipit-source-id: 565d50841e19a45f8798a490aa3aa6b9f69ca404
2021-09-23 22:14:32 -07:00
Guangyun Han
7dbc21bc2b Enable CUPTI for kineto by default on windows. (#62175)
Summary:
It fix nothing.

For tracking this PR, please refers to https://github.com/pytorch/kineto/issues/356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62175

Reviewed By: ezyang

Differential Revision: D30721329

Pulled By: gdankel

fbshipit-source-id: aa1af47df8cc1b6f5ba2194447f62b902a6a9c84
2021-09-23 15:13:47 -07:00
jiej
127c9402d0 Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137)
Summary:
This reverts commit 03389dc851.

Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745
Fixes the windows build failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d
2021-09-22 04:54:51 -07:00
Tao Xu
18fa58c4e9 [CoreML][OSS] Integrate with CMake (#64523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64523

- Build Pytorch with CoreML delegate - ` USE_PYTORCH_METAL=ON python setup.py install --cmake`
- Build iOS static libs - `IOS_PLATFORM=SIMULATOR USE_COREML_DELEGATE=1  ./scripts/build_ios.sh`
ghstack-source-id: 138324216

Test Plan:
- Test the Helloword example

{F657778559}

Reviewed By: iseeyuan

Differential Revision: D30594041

fbshipit-source-id: 8cece0b2d4b3ef82d3ef4da8c1054919148beb16
2021-09-17 10:32:00 -07:00
Jane Xu
9af6fe991c Remove CUDA 9.2 and older references from our cmake (#65065)
Summary:
Removes old CUDA references in our cuda.cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65065

Reviewed By: malfet

Differential Revision: D30992673

Pulled By: janeyx99

fbshipit-source-id: 85b524089ed57e5acbc71720267cf05e24a8c20a
2021-09-16 12:54:49 -07:00
Eli Uriegas
03389dc851 Revert D30752939: [pytorch][PR] nvfuser update
Test Plan: revert-hammer

Differential Revision:
D30752939 (cfaecaf40b)

Original commit changeset: ce122e80f01b

fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2
2021-09-15 17:38:47 -07:00
jiej
cfaecaf40b nvfuser update (#63745)
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
2021-09-15 14:42:55 -07:00
Nick Kreeger
882b67dff4 Drop incremental linking on Windows with REL_WITH_DEB_INFO=1. (#64892)
Summary:
The library will no longer link properly on VS 2019 (14.29.30133). To
ensure that engineers building on Windows can use and debug with this
build type, incremental linking needs to be turned off for this build
flag.

Verified that this build type successfully builds, links, and provides
debuggable Python modules on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64892

Reviewed By: jbschlosser

Differential Revision: D30902565

Pulled By: malfet

fbshipit-source-id: e5286a4c6f45c7cbe4cdc1b98560129bd386970b
2021-09-14 09:44:18 -07:00
Hanton Yang
22d38bd10d [OSS] Enable Metal in PyTorch MacOS nightly builds (#63718)
Summary:
Build on https://github.com/pytorch/pytorch/pull/63825

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63718

Test Plan:
1.Add `ci/binaries` label to PR, so the CI will build those nightly builds

2.Make sure the following CI jobs build with `USE_PYTORCH_METAL_EXPORT` option is `ON`:
```
ci/circleci: binary_macos_arm64_conda_3_8_cpu_nightly_build
ci/circleci: binary_macos_arm64_conda_3_9_cpu_nightly_build
ci/circleci: binary_macos_arm64_wheel_3_8_cpu_nightly_build
ci/circleci: binary_macos_arm64_wheel_3_9_cpu_nightly_build
ci/circleci: binary_macos_conda_3_6_cpu_nightly_build
ci/circleci: binary_macos_conda_3_7_cpu_nightly_build
ci/circleci: binary_macos_conda_3_8_cpu_nightly_build
ci/circleci: binary_macos_conda_3_9_cpu_nightly_build
ci/circleci: binary_macos_libtorch_3_7_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_6_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_7_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_8_cpu_nightly_build
ci/circleci: binary_macos_wheel_3_9_cpu_nightly_build
```

3.Test `conda` and `wheel` builds locally on [HelloWorld-Metal](https://github.com/pytorch/ios-demo-app/tree/master/HelloWorld-Metal) demo with [(Prototype) Use iOS GPU in PyTorch](https://pytorch.org/tutorials/prototype/ios_gpu_workflow.html)

(1) conda
```
conda install https://15667941-65600975-gh.circle-artifacts.com/0/Users/distiller/project/final_pkgs/pytorch-1.10.0.dev20210826-py3.8_0.tar.bz2
```
(2) wheel
```
pip3 install https://15598647-65600975-gh.circle-artifacts.com/0/Users/distiller/project/final_pkgs/torch-1.10.0.dev20210824-cp38-none-macosx_10_9_x86_64.whl
```

Reviewed By: xta0

Differential Revision: D30593167

Pulled By: hanton

fbshipit-source-id: 471da204e94b29c11301c857c50501307a5f0785
2021-08-27 09:25:05 -07:00
Peter Bell
e4f44bec27 Fix pocketfft include path in mobile build (#63714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63714

PocketFFT was disabled for CMake < 3.9 but CMake 3.11 is the first version to support `INCLUDE_DIRECTORIES` as a target property. So updating to CMake 3.10 causes the mobile builds to fail. Instead of limiting the CMake support, this just adds the include directory to the entire target,

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30498369

Pulled By: malfet

fbshipit-source-id: 83372e29c477c97e7015763b7c29d6d7e456bcef
2021-08-23 17:48:57 -07:00
Nikita Shulga
bec75daa77 Update protobuf to 3.13.1 (#62571)
Summary:
Update bazel to 4.10.0

Update ASAN_SYMBOLIZER_PATH to llvm-7
Suppress `vptr` ubsan violations in `test_jit`
Fix ProtoBuf patching for ONNX which caused Windows builds to crash while attempting to free `std::string` allocated on stack

Fixes https://github.com/pytorch/pytorch/issues/62569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62571

Reviewed By: walterddr

Differential Revision: D30048685

Pulled By: malfet

fbshipit-source-id: 6462c1bef9c42318551d2cf906bbab41e1d4e1cd
2021-08-19 23:43:55 -07:00
driazati
bd8608cd5c Use CMake for breakpad (#63186)
Summary:
We currently build breakpad from [this fork](https://github.com/driazati/breakpad) to include extra logic to restore signal handlers that were previously present. With some [new additions](https://github.com/google/breakpad/compare/main...driazati:main) this fork now includes a CMake based build, so we can add breakpad as a proper dependency rather than rely on including it in Docker images as a system library which is error prone (we have a bunch of images) and hard to extend to MacOS / Windows. This also includes some changes to the crash handling code to support MacOS / Windows in a similar way to Linux.

```python
import torch

# On Windows this writes crashes to C:\Users\<user>\AppData\pytorch_crashes
# On MacOS/Linux this writes crashes to /tmp/pytorch_crashes
torch.utils._crash_handler.enable_minidumps()

# Easy way to cause a segfault and trigger the handler
torch.bincount(input=torch.tensor([9223372036854775807]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63186

Reviewed By: malfet, seemethere

Differential Revision: D30318404

Pulled By: driazati

fbshipit-source-id: 0d7daf3701cfaba5451cc529a0730272ab1eb1dc
2021-08-19 10:42:01 -07:00
peterjc123
9bb1371cc2 Disable RDYNAMIC check with MSVC (#62949)
Summary:
When testing with clang-cl, the flag is added though it is unsupported and that generates a few warnings. Tried a few alternatives like https://cmake.org/cmake/help/latest/module/CheckLinkerFlag.html, but they just don't work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62949

Reviewed By: zhouzhuojie, driazati

Differential Revision: D30359206

Pulled By: malfet

fbshipit-source-id: 1bd27ad5772fe6757fa8c3a4bddf904f88d70b7b
2021-08-18 11:51:23 -07:00
Nikita Shulga
6e5d065b2b Add pocketfft as submodule (#62841)
Summary:
Using https://github.com/mreineck/pocketfft

Also delete explicit installation of pocketfft during the build as it will be available via submodule

Limit PocketFFT support to cmake-3.10 or newer, as `set_source_files_properties` does not seem to work as expected with cmake-3.5

Partially addresses https://github.com/pytorch/pytorch/issues/62821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62841

Reviewed By: seemethere

Differential Revision: D30140441

Pulled By: malfet

fbshipit-source-id: d1a1cf1b43375321f5ec5b3d0b538f58082f7825
2021-08-17 15:29:56 -07:00
Kimish Patel
38c185189c [Pytorch Edge] Enable kineto profiler on mobile via EdgeKinetoProfiler (#62419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62419

This diff adds support for cpu only kineto profiler on mobile. Thus
enabling chrome trace generation on mobile. This bring cpp API for
mobile profiling on part with Torchscript.
This is done via:
1. Utilizating debug handle annotations in KinetoEvent.
2. Adding post processing capability, via callbacks, to
KinetoThreadLocalState
3. Creating new RAII stype profiler, KinetoEdgeCPUProfiler, which can be
used in surrounding scope of model execution. This will write chrome
trace to the location specified in profiler constructor.

Test Plan:
MobileProfiler.ModuleHierarchy

Imported from OSS

Reviewed By: raziel

Differential Revision: D29993660

fbshipit-source-id: 0b44f52f9e9c5f5aff81ebbd9273c254c3c03299
2021-08-13 21:40:19 -07:00
Pruthvi Madugundu
ab7a472980 [ROCm] Update HIP_VERSION to TORCH_HIP_VERSION (#62786)
Summary:
- HIP_VERSION semantic versioning will change in ROCm4.3. The changes essentially remove the dependency on HIP_VERSION provided in the hip header to keep code compatible with older and newer versions of ROCm.
- TORCH_HIP_VERSION is derived from HIP_VERSION_MAJOR and HIP_VERSION_MINOR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62786

Reviewed By: bdhirsh

Differential Revision: D30281682

Pulled By: seemethere

fbshipit-source-id: e41e69fb9e13de5ddd1af99ba5bbdcbb7b64b673
2021-08-13 15:00:43 -07:00
Isuru Fernando
b58e04f156 Make sure FindLAPACK finds the same BLAS library (#49647)
Summary:
BLAS library is found by cmake/Dependencies.cmake and then
LAPACK library is found by FindLAPACK.cmake which in turn calls
FindBLAS.cmake. This means that we are searching for BLAS twice
and they might be different things. By setting a few variables,
this can be avoided.

cc seemethere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49647

Reviewed By: seemethere, ejguan

Differential Revision: D29943680

Pulled By: malfet

fbshipit-source-id: 3cbc350ea645a1a28dd92c19e5ee7f9eecdeff59
2021-08-02 20:41:00 -07:00
Can Balioglu
7565039ee9 Support system-provided Intel TBB (#61934)
Summary:
This PR: (1) enables the use of a system-provided Intel TBB for building PyTorch, (2) removes `tbb:task_scheduler_init` references since it has been removed from TBB a while ago (3) marks the implementation of `_internal_set_num_threads` with a TODO as it requires a revision that fixes its thread allocation logic.

Tested with `test/run_test`; no new tests are introduced since there are no behavioral changes (removal of `tbb::task_scheduler_init` has no impact on the runtime behavior).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61934

Reviewed By: malfet

Differential Revision: D29805416

Pulled By: cbalioglu

fbshipit-source-id: 22042b428b57b8fede9dfcc83878d679a19561dd
2021-08-02 07:39:00 -07:00
Brian Vaughan
2eef1f27f8 Disable ccache for nccl builds (#62208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62208

reverts
https://github.com/pytorch/pytorch/pull/55814
which removed a workaround for:
https://github.com/pytorch/pytorch/issues/13362

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29935472

Pulled By: nairbv

fbshipit-source-id: 7ce9cde1408f17153632036fd128814032739746
2021-07-27 08:07:26 -07:00
Jane Xu
e318058ffe Ignore LNK4099 for debug binary libtorch builds (#62060)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/61979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62060

Test Plan:
This CI shouldn't break
and https://github.com/pytorch/pytorch/pull/62061

Reviewed By: driazati

Differential Revision: D29877487

Pulled By: janeyx99

fbshipit-source-id: 497f84caab3f9ae609644fd397ad87a6dc8a2a77
2021-07-23 09:31:41 -07:00
imaginary-person
9e53c823b8 Add AVX512 support in ATen & remove AVX support (#61903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61903

### Remaining Tasks

- [ ] Collate results of benchmarks on two Intel Xeon machines (with & without CUDA, to check if CPU throttling causes issues with GPUs) - make graphs, including Roofline model plots (Intel Advisor can't make them with libgomp, though, but with Intel OpenMP).

### Summary

1. This draft PR produces binaries with with 3 types of ATen kernels - default, AVX2, AVX512 . Using the environment variable `ATEN_AVX512_256=TRUE`  also results in 3 types of kernels, but the compiler can use 32 ymm registers for AVX2, instead of the default 16. ATen kernels for `CPU_CAPABILITY_AVX` have been removed.

2. `nansum` is not using AVX512 kernel right now, as it has poorer accuracy for Float16, than does AVX2 or DEFAULT, whose respective accuracies aren't very good either (#59415).
It was more convenient to disable AVX512 dispatch for all dtypes of `nansum` for now.

3. On Windows , ATen Quantized AVX512 kernels are not being used, as quantization tests are flaky. If `--continue-through-failure` is used, then `test_compare_model_outputs_functional_static` fails. But if this test is skipped, `test_compare_model_outputs_conv_static` fails. If both these tests are skipped, then a third one fails. These are hard to debug right now due to not having access to a Windows machine with AVX512 support, so it was more convenient to disable AVX512 dispatch of all ATen Quantized kernels on Windows for now.

4. One test is currently being skipped -
[test_lstm` in `quantization.bc](https://github.com/pytorch/pytorch/issues/59098) - It fails only on Cascade Lake machines, irrespective of the `ATEN_CPU_CAPABILITY` used, because FBGEMM uses `AVX512_VNNI` on machines that support it. The value of `reduce_range` should be used as `False` on such machines.

The list of the changes is at https://gist.github.com/imaginary-person/4b4fda660534f0493bf9573d511a878d.

Credits to ezyang for proposing `AVX512_256` - these use AVX2 intrinsics but benefit from 32 registers, instead of the 16 ymm registers that AVX2 uses.
Credits to limo1996 for the initial proposal, and for optimizing `hsub_pd` & `hadd_pd`, which didn't have direct AVX512 equivalents, and are being used in some kernels. He also refactored `vec/functional.h` to remove duplicated code.
Credits to quickwritereader for helping fix 4 failing complex multiplication & division tests.

### Testing
1. `vec_test_all_types` was modified to test basic AVX512 support, as tests already existed for AVX2.
Only one test had to be modified, as it was hardcoded for AVX2.
2.  `pytorch_linux_bionic_py3_8_gcc9_coverage_test1` & `pytorch_linux_bionic_py3_8_gcc9_coverage_test2` are now using `linux.2xlarge` instances, as they support AVX512. They were used for testing AVX512 kernels, as AVX512 kernels are being used by default in both of the CI checks. Windows CI checks had already been using machines with AVX512 support.

### Would the downclocking caused by AVX512 pose an issue?

I think it's important to note that AVX2 causes downclocking as well, and the additional downclocking caused by AVX512 may not hamper performance on some Skylake machines & beyond, because of the double vector-size. I think that [this post with verifiable references is a must-read](https://community.intel.com/t5/Software-Tuning-Performance/Unexpected-power-vs-cores-profile-for-MKL-kernels-on-modern-Xeon/m-p/1133869/highlight/true#M6450). Also, AVX512 would _probably not_ hurt performance on a high-end machine, [but measurements are recommended](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/). In case it does, `ATEN_AVX512_256=TRUE` can be used for building PyTorch, as AVX2 can then use 32 ymm registers instead of the default 16. [FBGEMM uses `AVX512_256` only on Xeon D processors](https://github.com/pytorch/FBGEMM/pull/209), which are said to have poor AVX512 performance.

This [official data](https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf) is for the Intel Skylake family, and the first link helps understand its significance. Cascade Lake & Ice Lake SP Xeon processors are said to be even better when it comes to AVX512 performance.

Here is the corresponding data for [Cascade Lake](https://cdrdv2.intel.com/v1/dl/getContent/338848) -

![CASCADE LAKE AVX2](https://user-images.githubusercontent.com/76181208/120666172-ffec3f80-c451-11eb-8ea1-8933ccc12a1b.PNG)
![CASCADE LAKE AVX512](https://user-images.githubusercontent.com/76181208/120666190-04b0f380-c452-11eb-9faa-38d233c874c8.PNG)

The corresponding data isn't publicly available for Intel Xeon SP 3rd gen (Ice Lake SP), but [Intel mentioned that the 3rd gen has frequency improvements pertaining to AVX512](https://newsroom.intel.com/wp-content/uploads/sites/11/2021/04/3rd-Gen-Intel-Xeon-Scalable-Platform-Press-Presentation-281884.pdf). Ice Lake SP machines also have 48 KB L1D caches, so that's another reason for AVX512 performance to be better on them.

### Is PyTorch always faster with AVX512?

No, but then PyTorch is not always faster with AVX2 either. Please refer to #60202. The benefit from vectorization is apparent with with small tensors that fit in caches or in kernels that are more compute heavy. For instance, AVX512 or AVX2 would yield no benefit for adding two 64 MB tensors, but adding two 1 MB tensors would do well with AVX2, and even more so with AVX512.

It seems that memory-bound computations, such as adding two 64 MB tensors can be slow with vectorization (depending upon the number of threads used), as the effects of downclocking can then be observed.

Original pull request: https://github.com/pytorch/pytorch/pull/56992

Reviewed By: soulitzer

Differential Revision: D29266289

Pulled By: ezyang

fbshipit-source-id: 2d5e8d1c2307252f22423bbc14f136c67c3e6184
2021-07-22 08:51:49 -07:00
Hong Xu
7acb8b71e1 Remove AVX detection code that duplicates FindAVX.cmake (#61748)
Summary:
This PR deletes some code in `MiscCheck.cmake` that perform the exact
same functionality as `FindAVX.cmake`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61748

Reviewed By: ejguan

Differential Revision: D29791282

Pulled By: malfet

fbshipit-source-id: 6595fd1b61c8ae12b821fad8c9a34892dd52d213
2021-07-20 14:34:36 -07:00
Hong Xu
f912889726 Remove unnecessary Ubuntu version checks (#61738)
Summary:
PR https://github.com/pytorch/pytorch/issues/5401 missed another Ubuntu version check in `cmake/MiscCheck.cmake`.

The check for available functions added by https://github.com/pytorch/pytorch/issues/5401 are already present below the code snippet that this PR deletes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61738

Reviewed By: mrshenli

Differential Revision: D29757525

Pulled By: ezyang

fbshipit-source-id: 7f5f9312284973481a8b8a2b9c51cc09774722e9
2021-07-19 13:04:24 -07:00
Tongliang Liao
0afbb9e81e PYTHON_LIBRARY may be set to empty or NOTFOUND. (#61230)
Summary:
Not sure why (maybe from dependencies?) but it can certainly break package lookup upon re-entry of cmake.
So instead of checking whether they are defined, we should check whether there is any meaningful value inside.

Fixes https://github.com/pytorch/pytorch/issues/59887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61230

Reviewed By: H-Huang

Differential Revision: D29668766

Pulled By: malfet

fbshipit-source-id: 79a59578740c4434327aff4f9a22eba9c4bf48d1
2021-07-13 07:09:31 -07:00
Luca Wehrstedt
c830db0265 Raise error in CMake for CUDA <9.2 (#61462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61462

Anything before CUDA 9.2 is not supported (see https://github.com/pytorch/pytorch/pull/36848), and perhaps not even that.
ghstack-source-id: 133312018

Test Plan: CI

Reviewed By: samestep

Differential Revision: D29637251

fbshipit-source-id: 4300169b7298274b2074649342902a34bd2220b5
2021-07-09 11:28:38 -07:00
shmsong
ee2dd35ef4 Resolving native dependency and try_run for cross compile (#59764)
Summary:
This is a PR on build system that provides support for cross compiling on Jetson platforms.

The major change is:

1. Disable try runs for cross compiling in `COMPILER_WORKS`, `BLAS`, and `CUDA`. They will not be able to perform try run on a cross compile setup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59764

Reviewed By: soulitzer

Differential Revision: D29524363

Pulled By: malfet

fbshipit-source-id: f06d1ad30b704c9a17d77db686c65c0754db07b8
2021-07-09 09:29:21 -07:00
Elton Leander Pinto
7481c6fc02 Bump googletest version to v1.11.0 (#61395)
Summary:
This PR bumps the `googletest` version to v1.11.0.

To facilitate this change, `CAFFE2_ASAN_FLAG` and `CAFFE2_TSAN_FLAG` are divided into corresponding compiler and linker variants. This is required because `googletest v1.11.0` sets the `-Werror` flag. The `-pie` flag is a linker flag, and passing it to a compiler invocation results in a `-Wunused-command-line-argument` warning, which in turn will cause `googletest` to fail to build with ASAN.

Fixes https://github.com/pytorch/pytorch/issues/60865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61395

Reviewed By: iramazanli

Differential Revision: D29620970

Pulled By: 1ntEgr8

fbshipit-source-id: cdb1d3d12e0fff834c2e62971e42c03f8c3fbf1b
2021-07-08 16:29:17 -07:00
zhouzhuojie
6107cf3750 Add --jobs 0 for git submodule update (#61311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61152

Some related docs about `submodule.fetchJobs`
https://git-scm.com/docs/git-config#Documentation/git-config.txt-submodulefetchJobs

```
time git submodule update --init --recursive
________________________________________________________
Executed in  243.20 secs    fish           external
   usr time   49.64 secs  213.00 micros   49.64 secs
   sys time   29.27 secs  795.00 micros   29.27 secs
```

```
time git submodule update --init --recursive --jobs 4
________________________________________________________
Executed in  143.04 secs    fish           external
   usr time   51.06 secs  246.00 micros   51.06 secs
   sys time   30.96 secs  742.00 micros   30.96 secs
```

```
time git submodule update --init --recursive --jobs 8
________________________________________________________
Executed in  124.64 secs    fish           external
   usr time   51.76 secs  264.00 micros   51.76 secs
   sys time   30.49 secs  739.00 micros   30.49 secs

```

```
time git submodule update --init --recursive --jobs 0 # use all online cpus
 ________________________________________________________
Executed in  129.75 secs    fish           external
   usr time   51.64 secs  181.00 micros   51.64 secs
   sys time   31.49 secs  781.00 micros   31.49 secs

```

Test Plan: Imported from OSS

Reviewed By: 1ntEgr8

Differential Revision: D29560875

Pulled By: zhouzhuojie

fbshipit-source-id: 556027dffe744c66428075a8a1bf64683930aaaf
2021-07-07 16:28:18 -07:00
Nikita Shulga
4036820506 Add PocketFFT support (#60976)
Summary:
Needed on platforms, that do not have MKL, such as aarch64 and M1
- Add `AT_POCKETFFT_ENABLED()` to Config.h.in
- Introduce torch._C.has_spectral that is true if PyTorch was compiled with either MKL or PocketFFT
- Modify spectral test to use skipCPUIfNoFFT instead of skipCPUIfNoMKL

Share implementation of `_out` functions as well as fft_fill_with_conjugate_symmetry_stub between MKL and PocketFFT implementations

Fixes https://github.com/pytorch/pytorch/issues/41592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60976

Reviewed By: walterddr, driazati, janeyx99, samestep

Differential Revision: D29466530

Pulled By: malfet

fbshipit-source-id: ac5edb3d40e7c413267825f92a5e8bc4bb249caf
2021-06-30 16:28:20 -07:00
Peter Bell
4a7d281119 Migrate THAllocator to ATen (#60325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60325

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29371715

Pulled By: ngimel

fbshipit-source-id: 78ec8368a48e1a4690d0664a0b02d2a235af98ff
2021-06-24 19:42:14 -07:00
Nikita Shulga
40a7c317bc Run BLAS F2C checks on host architecture (#60703)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60703

Reviewed By: driazati

Differential Revision: D29379727

Pulled By: malfet

fbshipit-source-id: dadbb1d39373887f07d59d0a05e093a5d070b016
2021-06-24 18:44:41 -07:00
Peter Bell
31a884987d Remove some TH includes from ATen (#60323)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60323

Test Plan: Imported from OSS

Reviewed By: malfet, anjali411

Differential Revision: D29252862

Pulled By: ngimel

fbshipit-source-id: 9ea13495d382c04dfd52b8dd63314f53b7e83936
2021-06-22 10:55:17 -07:00
Luca Wehrstedt
08ce5eedf5 [reland] Move RPC agents to libtorch (#60170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60170

Reland of #59939.

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29193234

fbshipit-source-id: ee2a90d6be961c10f91361512bdd4cadca43dd60
2021-06-18 05:15:09 -07:00
Mike Ruberry
f233274f30 Revert D28875276: Move RPC agents to libtorch
Test Plan: revert-hammer

Differential Revision:
D28875276 (fc50f91929)

Original commit changeset: f2f6970fd74d

fbshipit-source-id: 3c52af652579733ebea8ddfb06576a0ce262bf78
2021-06-17 00:48:58 -07:00
Luca Wehrstedt
fc50f91929 Move RPC agents to libtorch (#59939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59939

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28875276

fbshipit-source-id: f2f6970fd74de5f112636e78edaa4410c61d8c45
2021-06-15 16:20:53 -07:00
Nikita Shulga
1ea5c19c19 Add USE_WHOLE_CUDNN option (#59744)
Summary:
It is only enabled if USE_STATIC_CUDNN is enabled

Next step after https://github.com/pytorch/pytorch/pull/59721 towards resolving fast kernels stripping reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59744

Reviewed By: seemethere, ngimel

Differential Revision: D29007314

Pulled By: malfet

fbshipit-source-id: 7091e299c0c6cc2a8aa82fbf49312cecf3bb861a
2021-06-09 21:12:42 -07:00
Nikita Shulga
8845cbabf0 [CMake] Split caffe2::cudnn into public and private (#59721)
Summary:
This is only important for builds where cuDNN is linked statically into libtorch_cpu.
Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library.
Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening.
Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59721

Reviewed By: ngimel

Differential Revision: D29000967

Pulled By: malfet

fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336
2021-06-09 13:18:48 -07:00
Jiakai Liu
501320ed81 [pytorch] deprecate default_op_deps.yaml (#59573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59573

To do mobile selective build, we have several options:
1. static dispatch;
2. dynamic dispatch + static analysis (to create the dependency graph);
3. dynamic dispatch + tracing;

We are developing 3. For open source, we used to only support 1, and
currently we support both 1 and 2.

This file is only used for 2. It was introduced when we deprecated
the static dispatch (1). The motivation was to make sure we have a
low-friction selective build workflow for dynamic dispatch (2).
As the name indicates, it is the *default* dependency graph that users
can try if they don't bother to run the static analyzer themselves.
We have a CI to run the full workflow of 2 on every PR, which creates
the dependency graph on-the-fly instead of using the committed file.

Since the workflow to automatically update the file has been broken
for a while, it started to confuse other pytorch developers as people
are already manually editing it, and it might be broken for some models
already.

We reintroduced the static dispatch recently, so we decide to deprecate
this file now and automatically turn on static dispatch if users run
selective build without providing the static analysis graph.

The tracing-based selective build will be the ultimate solution we'd
like to provide for OSS, but it will take some more effort to polish
and release.

Differential Revision:
D28941020
D28941020

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Pulled By: ljk53

fbshipit-source-id: 9977ab8568e2cc1bdcdecd3d22e29547ef63889e
2021-06-07 19:37:37 -07:00
Jeff Daily
24e27af683 [ROCm] enable kernel asserts (#49624)
Summary:
Addresses missing ROCm feature indicated in https://github.com/pytorch/pytorch/issues/38943.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49624

Reviewed By: agolynski

Differential Revision: D28902459

Pulled By: malfet

fbshipit-source-id: 29c9b552770241a0ec52cd057ea45efc4389d838
2021-06-07 13:43:07 -07:00
Nikita Shulga
63956610a7 Search for static OpenBLAS compiled with OpenMP (#59428)
Summary:
Before that, only dynamically linked OpenBLAS compield with OpenMP could
be found.

Also get rid of hardcoded codepath for libgfortran.a in FindLAPACK.cmake

Only affects aarch64 linux builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59428

Reviewed By: agolynski

Differential Revision: D28891314

Pulled By: malfet

fbshipit-source-id: 5af55a14c85ac66551ad2805c5716bbefe8d55b2
2021-06-04 08:09:21 -07:00
Michael Wootton
e66015dadf Add build support for kineto + rocm (#58401)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58399

CMake changes to allow kineto to build with rocm support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58401

Reviewed By: mruberry

Differential Revision: D28479807

Pulled By: walterddr

fbshipit-source-id: fc01f05b2a5592ee1d1dbd71d2d4f7aec1bd74f7
2021-06-03 12:15:20 -07:00
Eli Uriegas
eb1adc4c5e cmake: Add USE_GLOO_WITH_OPENSSL to Summary.cmake (#59321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59321

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28839370

Pulled By: seemethere

fbshipit-source-id: 0d4b35c05c2b1a78b752088cd16cd6263958e7f6
2021-06-02 11:10:55 -07:00
neginraoof
599f5058cf [ONNX] Update ONNX to rel-1.9 (#55889) (#57080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57080

ONNX optimizer is removed in ONNX 1.9
This PR removes ONNX optimizer from a C++ code path and uses `try-except` block in Python to keep it compatible with both ONNX-1.8 and 1.9.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D28467330

Pulled By: malfet

fbshipit-source-id: 5e4669dd0537648898e593f9e253da18d6dc7568

Co-authored-by: neginraoof <neginmr@utexas.edu>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-06-02 08:27:17 -07:00
Jeff Daily
ba694520e5 [ROCm] fix JIT codegen (#57400)
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.

- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
  - NAN / POS_INFINITY / NEG_INFINITY
  - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400

Reviewed By: ejguan

Differential Revision: D28421065

Pulled By: malfet

fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074
2021-05-27 11:45:07 -07:00
Nikita Shulga
7179e7ea7b [CMake] Prefer third_party/pybind11 by default (#58951)
Summary:
To make build behaviour aligned with other third_party/ libraries,
introduce `USE_SYSTEM_PYBIND11 (d55b25a633)` build option, which set to OFF by
default, which means PyTorch will be build with bundled pybind11 even if
other version is already installed locally.

Fixes https://github.com/pytorch/pytorch/issues/58750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58951

Reviewed By: driazati

Differential Revision: D28690411

Pulled By: malfet

fbshipit-source-id: e56b5a8f2a23ee1834b2a6d3807f287149decf8c
2021-05-25 15:10:17 -07:00
Nikita Shulga
2dda8d7571 Move cublas dependency after CuDNN (#58287)
Summary:
Library linking order matters during static linking
Not sure whether its a bug or a feature, but if cublas is reference
before CuDNN, it will be partially statically linked into the library,
even if it is not used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58287

Reviewed By: janeyx99

Differential Revision: D28433165

Pulled By: malfet

fbshipit-source-id: 8dffa0533075126dc383428f838f7d048074205c
2021-05-24 09:39:09 -07:00
Gordon Fossum
007fe949aa Adding a new include directory in BLIS search path (#58166)
Summary:
While trying to build PyTorch with BLIS as the backend library,
we found a build issue due to some missing include files.
This was caused by a missing directory in the search path.
This patch adds that path in FindBLIS.cmake.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58166

Reviewed By: zou3519

Differential Revision: D28640460

Pulled By: malfet

fbshipit-source-id: d0cd3a680718a0a45788c46a502871b88fbadd52
2021-05-24 08:57:02 -07:00
Nathan John Sircombe
bf00d26deb Enables builds with Compute Library backend for oneDNN (#55913)
Summary:
Since v1.7, oneDNN (MKL-DNN) has supported the use of Compute Library
for the Arm architeture to provide optimised convolution primitives
on AArch64.

This change enables the use of Compute Library in the PyTorch build.
Following the approach used to enable the use of CBLAS in MKLDNN,
It is enabled by setting the env vars USE_MKLDNN and USE_MKLDNN_ACL.
The location of the Compute Library build must be set useing `ACL_ROOT_DIR`.

This is an extension of the work in https://github.com/pytorch/pytorch/pull/50400
which added support for the oneDNN/MKL-DNN backend on AArch64.

_Note: this assumes that Compute Library has been built and installed at
ACL_ROOT_DIR. Compute library can be downloaded here:
`https://github.com/ARM-software/ComputeLibrary`_

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55913

Reviewed By: ailzhang

Differential Revision: D28559516

Pulled By: malfet

fbshipit-source-id: 29d24996097d0a54efc9ab754fb3f0bded290005
2021-05-20 07:43:56 -07:00
Xiang Gao
6c70cbedb6 step 0 of cuDNN v8 convolution API integration (#51390)
Summary:
This PR is step 0 of adding PyTorch convolution bindings using the cuDNN frontend. The cuDNN frontend is the recommended way of using cuDNN v8 API. It is supposed to have faster release cycles, so that, for example, if people find a specific kernel has a bug, they can report it, and that kernel will be blocked in the cuDNN frontend and frameworks could just update that submodule without the need for waiting for a whole cuDNN release.

The work is not complete, and this PR is only step 0.

**What this PR does:**
- Add cudnn-frontend as a submodule.
- Modify cmake to build that submodule.
- Add bindings for convolution forward in `Conv_v8.cpp`, which is disabled by a macro by default.
- Tested manually by enabling the macro and run `test_nn.py`. All tests pass except those mentioned below.

**What this PR doesn't:**
- Only convolution forward, no backward. The backward will use v7 API.
- No 64bit-indexing support for some configuration. This is a known issue of cuDNN, and will be fixed in a later cuDNN version. PyTorch will not implement any workaround for issue, but instead, v8 API should be disabled on problematic cuDNN versions.
- No test beyond PyTorch's unit tests.
  - Not tested for correctness on real models.
  - Not benchmarked for performance.
- Benchmark cache is not thread-safe. (This is marked as `FIXME` in the code, and will be fixed in a follow-up PR)
- cuDNN benchmark is not supported.
- There are failing tests, which will be resolved later:
  ```
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (in...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_large_cuda - RuntimeError: CUDNN_BACKEND_OPERATION: cudnnFinalize Failed cudnn_status: 9
  FAILED test/test_nn.py::TestNN::test_Conv2d_depthwise_naive_groups_cuda - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=1e-05, found 64 element(s) (out of 64) whose difference(s) exceeded the margin of error (including 0 an...
  FAILED test/test_nn.py::TestNN::test_Conv2d_deterministic_cudnn - RuntimeError: not supported yet
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_fp32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_tf32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  ```

Although this is not a complete implementation of cuDNN v8 API binding, I still want to merge this first. This would allow me to do small and incremental work, for the ease of development and review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51390

Reviewed By: malfet

Differential Revision: D28513167

Pulled By: ngimel

fbshipit-source-id: 9cc20c9dec5bbbcb1f94ac9e0f59b10c34f62740
2021-05-19 12:54:09 -07:00
peter
432676599c Stop installing libuv on Windows (#51936)
Summary:
Fixes #{issue number}
gunandrose4u

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51936

Reviewed By: malfet

Differential Revision: D28467662

Pulled By: seemethere

fbshipit-source-id: 28d203ee3af13d6a3158f188c2e889e310ee6010
2021-05-17 08:52:29 -07:00
Ilia Cherniavskii
6997e7bd39 Update Kineto submodule (#58179)
Summary:
Update Kineto submodule, minor api changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58179

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D28391369

Pulled By: ilia-cher

fbshipit-source-id: 61fbf63d9ec2db66fac203944679e4b99cb0d568
2021-05-13 04:03:04 -07:00
Ivan Kobzarev
111c99cdfd [vulkan] Fix glslc path for desktop build (#56507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56507

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D27951058

Pulled By: IvanKobzarev

fbshipit-source-id: 29443b61264bb28ae4982ed9f4c21f1c45f6b519
2021-05-11 14:18:39 -07:00
Ilia Cherniavskii
c714596027 [kineto] Update Kineto submodule, cupti library paths (#57789)
Summary:
Update kineto submodule, improve cupti detection

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57789

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28297175

Pulled By: ilia-cher

fbshipit-source-id: 5895270fae160097ae8872a592984d0e4a1b187b
2021-05-10 19:15:59 -07:00
Ilia Cherniavskii
65fad0ebd2 Expand Kineto platform support (ci-all) (#56323)
Summary:
Expanding support to all builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56323

Test Plan: CI

Reviewed By: malfet

Differential Revision: D28171478

Pulled By: ilia-cher

fbshipit-source-id: 16bc752d1be3cbaeda5316f5d8a687ae05a83d22
2021-05-05 15:00:01 -07:00
Nikita Shulga
133d8abbfc Compute nvrtc during libtorch build (#57579)
Summary:
The warning is completely harmless, but it still its nice not to emit it
when it could be computed.

Fixes https://github.com/pytorch/pytorch/issues/53350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57579

Reviewed By: walterddr

Differential Revision: D28208938

Pulled By: malfet

fbshipit-source-id: 8dcc3f1bff7c5ed2c0157268c3063228d3c445b6
2021-05-04 22:51:24 -07:00
davidriazati@fb.com
264d87985a Use ld.gold by default to link in CI (#57061)
Summary:
This adds an option to CMake to use `ld.gold` to link rather than `ld` (which symlinks to `ld.bfd` on Ubuntu by default). This shouldn't change any functionality, only a mild improvement on link times during builds (shaves off 1 minute) on CI.

Verify by searching for `ld.gold is available` in [the logs](https://circleci.com/api/v1.1/project/github/pytorch/pytorch/13046834/output/105/0?file=true&allocation-id=608c434338107e5b6cf938a1-0-build%2F7BDA2FF1)
](https://our.intern.facebook.com/intern/diff/28123522/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57061

Pulled By: driazati

Reviewed By: janeyx99

Differential Revision: D28123522

fbshipit-source-id: 5a60798ca4785427fd92bbf3b3aa5f63730e9b20
2021-05-03 10:05:36 -07:00
davidriazati@fb.com
c44cbc63cc Ignore more compiler warnings, unify WERROR options (#56630)
Summary:
This adds some more compiler warnings ignores for everything that happens on a standard CPU build (CUDA builds still have a bunch of warnings so we can't turn on `-Werror` everywhere yet).
](https://our.intern.facebook.com/intern/diff/28005063/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56630

Pulled By: driazati

Reviewed By: malfet

Differential Revision: D28005063

fbshipit-source-id: 541ed415eb0470ddf7e08c22c5eb6da9db26e9a0
2021-04-29 21:20:29 -07:00
davidriazati@fb.com
4b96fc060b Remove distutils (#57040)
Summary:
[distutils](https://docs.python.org/3/library/distutils.html) is on its way out and will be deprecated-on-import for Python 3.10+ and removed in Python 3.12 (see [PEP 632](https://www.python.org/dev/peps/pep-0632/)). There's no reason for us to keep it around since all the functionality we want from it can be found in `setuptools` / `sysconfig`. `setuptools` includes a copy of most of `distutils` (which is fine to use according to the PEP), that it uses under the hood, so this PR also uses that in some places.

Fixes #56527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57040

Pulled By: driazati

Reviewed By: nikithamalgifb

Differential Revision: D28051356

fbshipit-source-id: 1ca312219032540e755593e50da0c9e23c62d720
2021-04-29 12:10:11 -07:00
davidriazati@fb.com
d1b6383d65 Hide warnings for deprecated quantization APIs (#56291)
Summary:
These have a tracking task to actually fix them but in the meantime they
should not be clogging up everyone's build output (see #55952).
](https://our.intern.facebook.com/intern/diff/27830229/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56291

Pulled By: driazati

Reviewed By: bertmaher

Differential Revision: D27830229

fbshipit-source-id: f1e5d6e9b2c63d4a4ad99a1744a520f8c681c22b
2021-04-19 11:11:33 -07:00
Jeff Daily
e1752ffa04 [reland][ROCm] use hiprtc precompiled header (#55965)
Summary:
Revert "Revert D27449031 (2a7df657fe): [pytorch][PR] [ROCm] use hiprtc precompiled header".  Reland PR https://github.com/pytorch/pytorch/issues/54350.

This reverts commit 204ac21bf1.

The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965

Reviewed By: jbschlosser

Differential Revision: D27755907

Pulled By: malfet

fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e
2021-04-15 15:47:56 -07:00
Jeff Daily
a128938a75 [ROCm] add MAGMA_HOME env var hint to cmake, centos-rocm Dockerfile (#54511)
Summary:
MAGMA_HOME was previously set for the ubuntu-rocm/Dockerfile.  However, this missed centos builds as well as any builds that do not use the CI image environments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54511

Reviewed By: jbschlosser

Differential Revision: D27755983

Pulled By: malfet

fbshipit-source-id: 1ffd2cd100f4221c2bb64e6915fa3372ee1f6247
2021-04-15 01:06:44 -07:00
Eddie Yan
81f181567a Add USE_MAGMA build flag (#55994)
Summary:
Many model pipelines/workflows don't use MAGMA even though it is included in the build by default. Leaving MAGMA kernels out of the build can save 60+MB of GPU memory when loading `libtorch_cuda.so` (tested on V100, current upstream master).

A current sharp corner of this flag is that toggling it when rebuilding requires `torch/include/THC/THCGeneral.h` to be *manually* deleted by the user, as even running `make clean` or `setup.py` with `--cmake` does not properly regenerate it with the appropriate substitution for `#cmakedefine USE_MAGMA`. Is there a way to force the regeneration of the header during a rebuild?

CC malfet ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55994

Reviewed By: mruberry

Differential Revision: D27766287

Pulled By: malfet

fbshipit-source-id: 93deca57befa0febb9c5b7875ecf0015c547d421
2021-04-15 00:43:12 -07:00
Eli Uriegas
b98f011cd4 cmake: Enable (s)ccache for nccl builds (#55814)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55814

I don't really know if the original issue is resolved but let's just
check and see if this passes CI so that we can potentially get some
speed up on our builds

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D27715734

Pulled By: seemethere

fbshipit-source-id: a8f90774dfd25b0abf8e57283fe3591a8d8f3c4b
2021-04-13 14:49:25 -07:00
Alexander Golynski
204ac21bf1 Revert D27449031: [pytorch][PR] [ROCm] use hiprtc precompiled header
Test Plan: revert-hammer

Differential Revision:
D27449031 (2a7df657fe)

Original commit changeset: 81a8d7847a47

fbshipit-source-id: b7b970c8ea4110357fba3ad4d52a86fa5641d90c
2021-04-01 06:42:04 -07:00
Jeff Daily
2a7df657fe [ROCm] use hiprtc precompiled header (#54350)
Summary:
HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release.  Conditionally add support for this feature.  Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features.

The use of this feature is conditionalized on a new ROCM_VERSION macro.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350

Reviewed By: H-Huang

Differential Revision: D27449031

Pulled By: malfet

fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3
2021-03-31 13:36:50 -07:00
Shruti Ramesh
f1f3c8b0fa Adding PyTorch + DNNL + AMD BLIS path (#54953)
Summary:
These changes provide the user with an additional option to choose the DNNL+BLIS path for PyTorch.

This assumes BLIS is already downloaded or built from source and the necessary library file is available at the location: $BLIS_HOME/lib/libblis.so and include files are available at: $BLIS_HOME/include/blis/blis.h and $BLIS_HOME/include/blis/cblas.h

Export the below variables to build PyTorch with MKLDNN+BLIS and proceed with the regular installation procedure as below:
$export BLIS_HOME=path-to-BLIS
$export PATH=$BLIS_HOME/include/blis:$PATH LD_LIBRARY_PATH=$BLIS_HOME/lib:$LD_LIBRARY_PATH
$export BLAS=BLIS USE_MKLDNN_CBLAS=ON WITH_BLAS=blis
$python setup.py install

CPU only Dockerfile to build PyTorch with AMD BLIS is available at : docker/cpu-blis/Dockerfile
Example command line to build using the Dockerfile:
sudo DOCKER_BUILDKIT=1 docker build . -t docker-image-repo-name
Example command line to run the built docker container:
sudo docker run --name container-name -it docker-image-repo-name

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54953

Reviewed By: glaringlee

Differential Revision: D27466799

Pulled By: malfet

fbshipit-source-id: e03bae9561be3a67429df3b1be95a79005c63050
2021-03-31 10:40:25 -07:00
Jeff Daily
1dffbe759b [ROCm] utilize PUBLIC vs PRIVATE linking to avoid incorrect dependencies (#54727)
Summary:
Fixes the build of projects that depend on torch, such as torchaudio.  Otherwise torchaudio will complain that gloo_hip is missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54727

Reviewed By: H-Huang

Differential Revision: D27361513

Pulled By: ezyang

fbshipit-source-id: 714cc2db23e7adf3e89303e941b78c27625b9460
2021-03-30 19:22:56 -07:00
Sam Estep
5bcbbf5373 Lint trailing newlines (#54737)
Summary:
*Context:* https://github.com/pytorch/pytorch/issues/53406 added a lint for trailing whitespace at the ends of lines. However, in order to pass FB-internal lints, that PR also had to normalize the trailing newlines in four of the files it touched. This PR adds an OSS lint to normalize trailing newlines.

The changes to the following files (made in 54847d0adb9be71be4979cead3d9d4c02160e4cd) are the only manually-written parts of this PR:

- `.github/workflows/lint.yml`
- `mypy-strict.ini`
- `tools/README.md`
- `tools/test/test_trailing_newlines.py`
- `tools/trailing_newlines.py`

I would have liked to make this just a shell one-liner like the other three similar lints, but nothing I could find quite fit the bill. Specifically, all the answers I tried from the following Stack Overflow questions were far too slow (at least a minute and a half to run on this entire repository):

- [How to detect file ends in newline?](https://stackoverflow.com/q/38746)
- [How do I find files that do not end with a newline/linefeed?](https://stackoverflow.com/q/4631068)
- [How to list all files in the Git index without newline at end of file](https://stackoverflow.com/q/27624800)
- [Linux - check if there is an empty line at the end of a file [duplicate]](https://stackoverflow.com/q/34943632)
- [git ensure newline at end of each file](https://stackoverflow.com/q/57770972)

To avoid giving false positives during the few days after this PR is merged, we should probably only merge it after https://github.com/pytorch/pytorch/issues/54967.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54737

Test Plan:
Running the shell script from the "Ensure correct trailing newlines" step in the `quick-checks` job of `.github/workflows/lint.yml` should print no output and exit in a fraction of a second with a status of 0. That was not the case prior to this PR, as shown by this failing GHA workflow run on an earlier draft of this PR:

- https://github.com/pytorch/pytorch/runs/2197446987?check_suite_focus=true

In contrast, this run (after correcting the trailing newlines in this PR) succeeded:

- https://github.com/pytorch/pytorch/pull/54737/checks?check_run_id=2197553241

To unit-test `tools/trailing_newlines.py` itself (this is run as part of our "Test tools" GitHub Actions workflow):
```
python tools/test/test_trailing_newlines.py
```

Reviewed By: malfet

Differential Revision: D27409736

Pulled By: samestep

fbshipit-source-id: 46f565227046b39f68349bbd5633105b2d2e9b19
2021-03-30 13:09:52 -07:00
Michael Melesse
2620bce42a [ROCM] load only hipfft separately past rocm4.1 (#54349)
Summary:
This PR is a follow up to https://github.com/pytorch/pytorch/pull/53408.

It only loads hipfft if the version is rocm 4.1 or after and stops loading rocfft. This was done to resolve some issues observed in our internal ci due to conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54349

Reviewed By: ezyang

Differential Revision: D27374252

Pulled By: ngimel

fbshipit-source-id: 724e80df5011ea8fabd81739e18ae8a13d3a7ea0
2021-03-26 19:54:25 -07:00
Chester Liu
6a4d2c61d5 Allow linking against vcomp on Windows (#54132)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54132

Reviewed By: zou3519

Differential Revision: D27181524

Pulled By: malfet

fbshipit-source-id: b79b34afb7edcc594d9b5907c5a7505b9cc5683b
2021-03-19 14:36:07 -07:00
Leonard Lausen
90bbe0b38b cmake: auto-detect ccache to speed up developer builds (#49389)
Summary:
https://ccache.dev/ is a compiler cache that speeds up subsequent builds. Auto-detecting ccache ensures that it is used on systems where it is available, greatly improving build times for developers. There is no risk in enabling ccache in practice. Please refer to https://ccache.dev/ for a short summary / motivation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49389

Reviewed By: ejguan

Differential Revision: D27169957

Pulled By: malfet

fbshipit-source-id: 673b60bbceb0d323901c8a992a75792c6da9b805
2021-03-18 14:20:53 -07:00
Michael Melesse
4c1af249fb [ROCM] load hipfft separately from rocfft (#53408)
Summary:
This PR makes changes to how hipfft is loaded in pytorch. hipfft is packaged in a separate library to rocfft following rocm 4.1.

We check the rocm version and if it is past rocm 4.1 we load hipfft in addition to rocfft.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53408

Reviewed By: albanD

Differential Revision: D26952702

Pulled By: malfet

fbshipit-source-id: f42be304b587c060816e39d36f5c1a2cdc37bfab
2021-03-11 09:18:33 -08:00
Sam Estep
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
Ilia Cherniavskii
795ed5ca3f Enable Kineto in CPU builds (#53174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53174

Enable Kineto also in the CPU builds (non-mobile, non-Windows(atm))

Test Plan: CI

Reviewed By: gdankel

Differential Revision: D26776112

Pulled By: ilia-cher

fbshipit-source-id: 8733f65c2993105136c853f2a7b6e497d0fa53bf
2021-03-04 19:15:52 -08:00
Ashkan Aliabadi
e5ecd1ddf8 [Vulkan]Fix build warnings-treated-as-error on Linux. (#52781)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52781

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D26669311

Pulled By: AshkanAliabadi

fbshipit-source-id: 78b08d0b264d4d5cf8af964c589b9b7d0ddc7311
2021-03-03 13:48:43 -08:00
Nikita Shulga
b3c4ac6319 Fix OpenBLAS discovery (#53168)
Summary:
Fix accidental regression introduced by https://github.com/pytorch/pytorch/issues/47940

`FIND_PACKAGE(OpenBLAS)` does not validate that discovered library can actually be used, while `check_fortran_libraries` does that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53168

Test Plan: Build PyTorch with static OpenBLAS and check that `torch.svd(torch.ones(3, 3)).S` do not raise an exception

Reviewed By: walterddr

Differential Revision: D26772345

Pulled By: malfet

fbshipit-source-id: 3e4675c176b30dfe4f0490d7d3dfe4f9a4037134
2021-03-03 08:23:02 -08:00
peter
8870c391e9 Update mkl to 2020.2.254 (#52964)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52964

Reviewed By: H-Huang

Differential Revision: D26726464

Pulled By: seemethere

fbshipit-source-id: 8f3067292e6416e299b4b040c8fb73510134f02e
2021-03-01 11:13:57 -08:00
Chen Lai
14f7bf0629 [PyTorch] update CMake to build libtorch lite (#51419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51419

## Summary

1. Add an option `BUILD_LITE_INTERPRETER` in `caffe2/CMakeLists.txt` and set `OFF` as default.
2. Update 'build_android.sh' with an argument to swtich `BUILD_LITE_INTERPRETER`, 'OFF' as default.
3. Add a mini demo app `lite_interpreter_demo` linked with `libtorch` library, which can be used for quick test.

## Test Plan
Built lite interpreter version of libtorch and test with Image Segmentation demo app ([android version](https://github.com/pytorch/android-demo-app/tree/master/ImageSegmentation)/[ios version](https://github.com/pytorch/ios-demo-app/tree/master/ImageSegmentation))

### Android
1. **Prepare model**: Prepare the lite interpreter version of model by run the script below to generate the scripted model `deeplabv3_scripted.pt` and `deeplabv3_scripted.ptl`
```
import torch

model = torch.hub.load('pytorch/vision:v0.7.0', 'deeplabv3_resnet50', pretrained=True)
model.eval()

scripted_module = torch.jit.script(model)
# Export full jit version model (not compatible lite interpreter), leave it here for comparison
scripted_module.save("deeplabv3_scripted.pt")
# Export lite interpreter version model (compatible with lite interpreter)
scripted_module._save_for_lite_interpreter("deeplabv3_scripted.ptl")

```
2. **Build libtorch lite for android**: Build libtorch for android for all 4 android abis (armeabi-v7a, arm64-v8a, x86, x86_64) `BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh`. This pr is tested on Pixel 4 emulator with x86, so use cmd `BUILD_LITE_INTERPRETER=1 ./scripts/build_pytorch_android.sh x86` to specify abi to save built time. After the build finish, it will show the library path:
```
...
BUILD SUCCESSFUL in 55s
134 actionable tasks: 22 executed, 112 up-to-date
+ find /Users/chenlai/pytorch/android -type f -name '*aar'
+ xargs ls -lah
-rw-r--r--  1 chenlai  staff    13M Feb 11 11:48 /Users/chenlai/pytorch/android/pytorch_android/build/outputs/aar/pytorch_android-release.aar
-rw-r--r--  1 chenlai  staff    36K Feb  9 16:45 /Users/chenlai/pytorch/android/pytorch_android_torchvision/build/outputs/aar/pytorch_android_torchvision-release.aar
```
3. **Use the PyTorch Android libraries built from source in the ImageSegmentation app**: Create a folder 'libs' in the path, the path from repository root will be `ImageSegmentation/app/libs`. Copy `pytorch_android-release` to the path `ImageSegmentation/app/libs/pytorch_android-release.aar`. Copy 'pytorch_android_torchvision` (downloaded from [here](https://oss.sonatype.org/#nexus-search;quick~torchvision_android)) to the path `ImageSegmentation/app/libs/pytorch_android_torchvision.aar` Update the `dependencies` part of `ImageSegmentation/app/build.gradle` to
```
dependencies {
    implementation 'androidx.appcompat:appcompat:1.2.0'
    implementation 'androidx.constraintlayout:constraintlayout:2.0.2'
    testImplementation 'junit:junit:4.12'
    androidTestImplementation 'androidx.test.ext:junit:1.1.2'
    androidTestImplementation 'androidx.test.espresso:espresso-core:3.3.0'

    implementation(name:'pytorch_android-release', ext:'aar')
    implementation(name:'pytorch_android_torchvision', ext:'aar')

    implementation 'com.android.support:appcompat-v7:28.0.0'
    implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
}
```
Update `allprojects` part in `ImageSegmentation/build.gradle` to
```

allprojects {
    repositories {
        google()
        jcenter()
        flatDir {
            dirs 'libs'
        }
    }
}
```
4. **Update model loader api**: Update `ImageSegmentation/app/src/main/java/org/pytorch/imagesegmentation/MainActivity.java` by
4.1 Add new import: `import org.pytorch.LiteModuleLoader;`
4.2 Replace the way to load pytorch lite model
```
//            mModule = Module.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.pt"));
            mModule = LiteModuleLoader.load(MainActivity.assetFilePath(getApplicationContext(), "deeplabv3_scripted.ptl"));
```
5. **Test app**: Build and run the ImageSegmentation app in Android Studio,
![image](https://user-images.githubusercontent.com/16430979/107696279-9cea5900-6c66-11eb-8286-4d1d68abff61.png)

### iOS
1. **Prepare model**: Same as Android.
2. **Build libtorch lite for ios** `BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR BUILD_LITE_INTERPRETER=1   ./scripts/build_ios.sh`
3. **Remove Cocoapods from the project**: run `pod deintegrate`
4. **Link ImageSegmentation demo app with the custom built library**:
Open your project in XCode, go to your project Target’s **Build Phases - Link Binaries With Libraries**, click the **+** sign and add all the library files located in `build_ios/install/lib`. Navigate to the project **Build Settings**, set the value **Header Search Paths** to `build_ios/install/include` and **Library Search Paths** to `build_ios/install/lib`.
In the build settings, search for **other linker flags**. Add a custom linker flag below
```
-all_load
```
Finally, disable bitcode for your target by selecting the Build Settings, searching for Enable Bitcode, and set the value to No.
**

5. Update library and api**
5.1 Update `TorchModule.mm``
To use the custom built libraries the project, replace `#import <LibTorch/LibTorch.h>` (in `TorchModule.mm`) which is needed when using LibTorch via Cocoapods with the code below:

```
//#import <LibTorch/LibTorch.h>
#include "ATen/ATen.h"
#include "caffe2/core/timer.h"
#include "caffe2/utils/string_utils.h"
#include "torch/csrc/autograd/grad_mode.h"
#include "torch/script.h"
#include <torch/csrc/jit/mobile/function.h>
#include <torch/csrc/jit/mobile/import.h>
#include <torch/csrc/jit/mobile/interpreter.h>
#include <torch/csrc/jit/mobile/module.h>
#include <torch/csrc/jit/mobile/observer.h>
```
5.2 Update `ViewController.swift`
```
//        if let filePath = Bundle.main.path(forResource:
//            "deeplabv3_scripted", ofType: "pt"),
//            let module = TorchModule(fileAtPath: filePath) {
//            return module
//        } else {
//            fatalError("Can't find the model file!")
//        }
        if let filePath = Bundle.main.path(forResource:
            "deeplabv3_scripted", ofType: "ptl"),
            let module = TorchModule(fileAtPath: filePath) {
            return module
        } else {
            fatalError("Can't find the model file!")
        }
```

### Unit test
Add `test/cpp/lite_interpreter`, with one unit test `test_cores.cpp` and a light model `sequence.ptl` to test `_load_for_mobile()`, `bc.find_method()` and `bc.forward()` functions.

### Size:
**With the change:**
Android:
x86: `pytorch_android-release.aar` (**13.8 MB**)

IOS:
`pytorch/build_ios/install/lib` (lib: **66 MB**):
```
(base) chenlai@chenlai-mp lib % ls -lh
total 135016
-rw-r--r--  1 chenlai  staff   3.3M Feb 15 20:45 libXNNPACK.a
-rw-r--r--  1 chenlai  staff   965K Feb 15 20:45 libc10.a
-rw-r--r--  1 chenlai  staff   4.6K Feb 15 20:45 libclog.a
-rw-r--r--  1 chenlai  staff    42K Feb 15 20:45 libcpuinfo.a
-rw-r--r--  1 chenlai  staff    39K Feb 15 20:45 libcpuinfo_internals.a
-rw-r--r--  1 chenlai  staff   1.5M Feb 15 20:45 libeigen_blas.a
-rw-r--r--  1 chenlai  staff   148K Feb 15 20:45 libfmt.a
-rw-r--r--  1 chenlai  staff    44K Feb 15 20:45 libpthreadpool.a
-rw-r--r--  1 chenlai  staff   166K Feb 15 20:45 libpytorch_qnnpack.a
-rw-r--r--  1 chenlai  staff   384B Feb 15 21:19 libtorch.a
-rw-r--r--  1 chenlai  staff    **60M** Feb 15 20:47 libtorch_cpu.a
```
`pytorch/build_ios/install`:
```
(base) chenlai@chenlai-mp install % du -sh *
 14M	include
 66M	lib
2.8M	share
```

**Master (baseline):**
Android:
x86: `pytorch_android-release.aar` (**16.2 MB**)

IOS:
`pytorch/build_ios/install/lib` (lib: **84 MB**):
```
(base) chenlai@chenlai-mp lib % ls -lh
total 172032
-rw-r--r--  1 chenlai  staff   3.3M Feb 17 22:18 libXNNPACK.a
-rw-r--r--  1 chenlai  staff   969K Feb 17 22:18 libc10.a
-rw-r--r--  1 chenlai  staff   4.6K Feb 17 22:18 libclog.a
-rw-r--r--  1 chenlai  staff    42K Feb 17 22:18 libcpuinfo.a
-rw-r--r--  1 chenlai  staff   1.5M Feb 17 22:18 libeigen_blas.a
-rw-r--r--  1 chenlai  staff    44K Feb 17 22:18 libpthreadpool.a
-rw-r--r--  1 chenlai  staff   166K Feb 17 22:18 libpytorch_qnnpack.a
-rw-r--r--  1 chenlai  staff   384B Feb 17 22:19 libtorch.a
-rw-r--r--  1 chenlai  staff    78M Feb 17 22:19 libtorch_cpu.a
```
`pytorch/build_ios/install`:
```
(base) chenlai@chenlai-mp install % du -sh *
 14M	include
 84M	lib
2.8M	share
```

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D26518778

Pulled By: cccclai

fbshipit-source-id: 4503ffa1f150ecc309ed39fb0549e8bd046a3f9c
2021-02-21 01:43:54 -08:00
Jiakai Liu
c9c4b871a5 [pytorch] reintroduce static dispatch (#51957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51957

This is a simplified version of #51554.

Compared to #51554, this version only supports statically dispatching to
a specific backend. The benefit is that it skipped the dispatch key
computation logic thus has less framework overhead. The downside is that
if input tensors do not match the specified backend it will throw error
instead of falling back to regular dispatch.

Sample code:
```
Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) {
    return at::cpu::empty(size, options, memory_format);
}

// aten::conj(Tensor(a) self) -> Tensor(a)
Tensor conj(const Tensor & self) {
    return at::math::conj(self);
}

// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_out(Tensor & out, const Tensor & self) {
    return at::cpu::conj_out(out, self);
}

// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_outf(const Tensor & self, Tensor & out) {
    return at::cpu::conj_out(out, self);
}

// aten::_conj(Tensor self) -> Tensor
Tensor _conj(const Tensor & self) {
    return at::defaultbackend::_conj(self);
}
```

For ops without the specific backend dispatch, it will throw error:
```
// aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool
bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) {
    TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU.");
}
```

Differential Revision: D26337857

Test Plan: Imported from OSS

Reviewed By: bhosmer

Pulled By: ljk53

fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
2021-02-19 11:41:39 -08:00
Nikita Shulga
08017f4598 Add explicit cudart_static dependency for cublas_static (#52509)
Summary:
Fixes following error during static linking, by enforcing that cudart dependency is put after cublasLt
```
/usr/bin/ld: /usr/local/cuda/lib64/libcublasLt_static.a(libcublasLt_static.a.o): undefined reference to symbol 'cudaStreamWaitEvent@libcudart.so.11.0'
/usr/local/cuda/lib64/libcudart.so: error adding symbols: DSO missing from command line
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52509

Reviewed By: janeyx99

Differential Revision: D26547622

Pulled By: malfet

fbshipit-source-id: 4e17f18cf0ab5479a549299faf2583a79fbda4b9
2021-02-19 10:45:49 -08:00
David Kyle
dbeda994db Update FindvecLib.cmake for macOS 10.14, 10.15 and Big Sur (#51288)
Summary:
When compiling libtorch on macOS there is the option to use the `vecLib` BLAS library from Apple's (Accelerate)[https://developer.apple.com/documentation/accelerate] framework. Recent versions of macOS have changed the location of veclib.h, this change adds the new locations to `FindvecLib.cmake`

To test run the following command:
```
BLAS=vecLib python setup.py install --cmake --cmake-only
```

The choice of BLAS library is confirmed in the output:
```
-- Trying to find preferred BLAS backend of choice: vecLib
-- Found vecLib: /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51288

Reviewed By: jbschlosser

Differential Revision: D26531136

Pulled By: malfet

fbshipit-source-id: ce86807ccbf66973f33b3acb99b7f40cfd182b9b
2021-02-19 08:04:10 -08:00
Nikita Shulga
de4c9ecc35 Fix libnvrtc discoverability in package patched by auditwheel (#52184)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52184

`auditwheel` inserts first 8 symbols of sha256 checksum of the library before relocating into the wheel package. This change adds logic for computing the same short sha sum and embedding it into LazyNVRTC as alternative name for libnvrt.so

Fixes https://github.com/pytorch/pytorch/issues/52075

Test Plan: Imported from OSS

Reviewed By: seemethere

Differential Revision: D26417403

Pulled By: malfet

fbshipit-source-id: e366dd22e95e219979f6c2fa39acb11585b34c72
2021-02-13 19:38:27 -08:00
Nikita Shulga
bf841b25e4 [cmake] Add explicit cublas->cudart dependency (#52243)
Summary:
Necessary to ensure correct link order, especially if libraries are
linked statically. Otherwise, one might run into:
```
/usr/bin/ld: /usr/local/cuda/lib64/libcublasLt_static.a(libcublasLt_static.a.o): undefined reference to symbol 'cudaStreamWaitEvent@libcudart.so.11.0'
/usr/local/cuda/lib64/libcudart.so: error adding symbols: DSO missing from command line
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52243

Reviewed By: seemethere, ngimel

Differential Revision: D26437159

Pulled By: malfet

fbshipit-source-id: 33b8bb5040bda10537833f3ad737f535488452ea
2021-02-13 18:21:33 -08:00
Chen Lai
ba7a2f6513 Add debug helper function to check target property (#52093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52093

# Summary
The previous debug if statement only prints the file list, but it's not clear whether the target includes the file list correctly. This function can examine the target so it's more accurate. This pr includes changes:
1. Add a file `DebugHelper.cmake` with `print_target_properties` function.
2. Replace the debug if statement `if(FALSE)` by adding magical variable `PRINT_CMAKE_DEBUG_INFO` and applying the variable accordingly.

Note: previous debug if statement output example:
```
-- CPU sources:
--   /Users/chenlai/pytorch/aten/src/ATen/BatchedFallback.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/BatchedTensorImpl.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/BatchingRegistrations.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/CPUGeneratorImpl.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/Context.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/DLConvertor.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/DynamicLibrary.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/ExpandUtils.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/LegacyTHFunctionsCPU.cpp
--   /Users/chenlai/pytorch/aten/src/ATen/MemoryOverlap.cpp
...
-- GPU sources:
-- CPU include:
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src/TH
--   /Users/chenlai/pytorch/aten/src/TH
--   /Users/chenlai/pytorch/aten/src
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src
...
-- GPU include:
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src/TH
--   /Users/chenlai/pytorch/aten/src/TH
--   /Users/chenlai/pytorch/build_android/caffe2/aten/src/TH
--   /Users/chenlai/pytorch/aten/src/TH
```

# Test plan
Set `PRINT_CMAKE_DEBUG_INFO` to true by adding `DPRINT_CMAKE_DEBUG_INFO` in `./scripts/build_pytorch_android.sh`, run `./scripts/build_pytorch_android.sh x86`

`print_target_properties(torch)` shows
```
torch ANDROID_ARCH = x86
torch ANDROID_STL_TYPE = c++_static
torch ARCHIVE_OUTPUT_DIRECTORY = /Users/chenlai/pytorch/build_android_x86/lib
torch AUTOGEN_ORIGIN_DEPENDS = ON
torch AUTOMOC_COMPILER_PREDEFINES = ON
torch AUTOMOC_MACRO_NAMES = Q_OBJECT;Q_GADGET;Q_NAMESPACE;Q_NAMESPACE_EXPORT
torch AUTOMOC_PATH_PREFIX = OFF
torch BINARY_DIR = /Users/chenlai/pytorch/build_android_x86/caffe2
torch BINARY_DIR = /Users/chenlai/pytorch/build_android_x86/caffe2
torch BUILD_WITH_INSTALL_RPATH = FALSE
torch CXX_STANDARD = 14
torch C_STANDARD = 11
torch IMPORTED = FALSE
torch IMPORTED_GLOBAL = FALSE
torch INCLUDE_DIRECTORIES = /Users/chenlai/pytorch/build_android_x86/aten/src;/Users/chenlai/pytorch/aten/src;/Users/chenlai/pytorch/build_android_x86;/Users/chenlai/pytorch;/Users/chenlai/pytorch/third_party/XNNPACK/include;/Users/chenlai/Library/Android/sdk/ndk/21.3.6528147/sources/third_party/vulkan/src/common;/Users/chenlai/pytorch/cmake/../third_party/eigen;/Users/chenlai/pytorch/cmake/../third_party/pybind11/include
torch INCLUDE_DIRECTORIES = /Users/chenlai/pytorch/build_android_x86/aten/src;/Users/chenlai/pytorch/aten/src;/Users/chenlai/pytorch/build_android_x86;/Users/chenlai/pytorch;/Users/chenlai/pytorch/third_party/XNNPACK/include;/Users/chenlai/Library/Android/sdk/ndk/21.3.6528147/sources/third_party/vulkan/src/common;/Users/chenlai/pytorch/cmake/../third_party/eigen;/Users/chenlai/pytorch/cmake/../third_party/pybind11/include
torch INCLUDE_DIRECTORIES = /Users/chenlai/pytorch/build_android_x86/aten/src;/Users/chenlai/pytorch/aten/src;/Users/chenlai/pytorch/build_android_x86;/Users/chenlai/pytorch;/Users/chenlai/pytorch/third_party/XNNPACK/include;/Users/chenlai/Library/Android/sdk/ndk/21.3.6528147/sources/third_party/vulkan/src/common;/Users/chenlai/pytorch/cmake/../third_party/eigen;/Users/chenlai/pytorch/cmake/../third_party/pybind11/include
torch INSTALL_RPATH = $ORIGIN
torch INSTALL_RPATH_USE_LINK_PATH = TRUE
torch INTERFACE_LINK_LIBRARIES = torch_cpu_library
torch ISPC_HEADER_SUFFIX = _ispc.h
torch LIBRARY_OUTPUT_DIRECTORY = /Users/chenlai/pytorch/build_android_x86/lib
torch LINK_LIBRARIES = torch_cpu_library
torch NAME = torch
torch PCH_INSTANTIATE_TEMPLATES = ON
torch PCH_WARN_INVALID = ON
torch POSITION_INDEPENDENT_CODE = TRUE
torch RUNTIME_OUTPUT_DIRECTORY = /Users/chenlai/pytorch/build_android_x86/bin
torch SKIP_BUILD_RPATH = FALSE
torch SOURCES = /Users/chenlai/pytorch/build_android_x86/empty.cpp
torch SOURCE_DIR = /Users/chenlai/pytorch/caffe2
torch SOURCE_DIR = /Users/chenlai/pytorch/caffe2
torch TYPE = STATIC_LIBRARY
torch TYPE = STATIC_LIBRARY
torch UNITY_BUILD_BATCH_SIZE = 8
torch UNITY_BUILD_MODE = BATCH
```

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D26377725

Pulled By: cccclai

fbshipit-source-id: dbe21ad533759f33711a0ce5328205bbcd5cf0f3
2021-02-11 15:37:14 -08:00
Jeff Daily
d02ea9a141 [ROCm] add hipMAGMA support (#51238)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48831.

- CI image is updated to build hipMAGMA from source and set env MAGMA_HOME.
- CMake is updated to separate different requirements for CUDA versus ROCm MAGMA.
- Some unit tests that become enabled with MAGMA are currently skipped for ROCm due to failures.  Fixing these failures will be follow-on work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51238

Reviewed By: ngimel

Differential Revision: D26184918

Pulled By: malfet

fbshipit-source-id: ada632f1ae7b413e8cae6543fe931dcd46985821
2021-02-01 22:09:33 -08:00
Jane Xu
88af2149e1 Add build option to split torch_cuda library into torch_cuda_cu and torch_cuda_cpp (#49050)
Summary:
Because of the size of our `libtorch_cuda.so`, linking with other hefty binaries presents a problem where 32bit relocation markers are too small and end up overflowing. This PR attempts to break up `torch_cuda` into `torch_cuda_cu` and `torch_cuda_cpp`.

`torch_cuda_cu`: all the files previously in `Caffe2_GPU_SRCS` that are
* pure `.cu` files in `aten`match
* all the BLAS files
* all the THC files, except for THCAllocator.cpp, THCCachingHostAllocator.cpp and THCGeneral.cpp
* all files in`detail`
* LegacyDefinitions.cpp and LegacyTHFunctionsCUDA.cpp
* Register*CUDA.cpp
* CUDAHooks.cpp
* CUDASolver.cpp
* TensorShapeCUDA.cpp

`torch_cuda_cpp`: all other files in `Caffe2_GPU_SRCS`

Accordingly, TORCH_CUDA_API and TORCH_CUDA_BUILD_MAIN_LIB usages are getting split as well to TORCH_CUDA_CU_API and TORCH_CUDA_CPP_API.

To test this locally, you can run `export BUILD_SPLIT_CUDA=ON && python setup.py develop`. In your `build/lib` folder, you should find binaries for both `torch_cuda_cpp` and `torch_cuda_cu`. To see that the SPLIT_CUDA option was toggled, you can grep the Summary of running cmake and make sure `Split CUDA` is ON.

This build option is tested on CI for CUDA 11.1 builds (linux for now, but windows soon).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49050

Reviewed By: walterddr

Differential Revision: D26114310

Pulled By: janeyx99

fbshipit-source-id: 0180f2519abb5a9cdde16a6fb7dd3171cff687a6
2021-02-01 18:42:35 -08:00
Luca Wehrstedt
b77f72b5a0 Enable TensorPipe's SHM transport (#50760)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50760

The SHM transport uses shared-memory-backed ringbuffers to transfer small payloads between processes on the same machine.

It was disabled in v1.6 due to a CMake mishap but we've since realized that it also doesn't work that well in docker and other setups. Enabling it here to see whether CircleCI fails.
ghstack-source-id: 120470890

Test Plan: Exported three times to CircleCI with tests consistently passing

Reviewed By: mrshenli

Differential Revision: D23814828

fbshipit-source-id: f355cb6515776debad536924de4f4d3fbb05a874
2021-01-27 11:45:09 -08:00
Will Constable
4bbff92014 Refactor build targets for torch::deploy (#50288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50288

torch::deploy will bundle the objects contained in libtorch-python together with frozenpython into a shared library.  Therefore, the libtorch-python objs can't bring with them a dependency on system python.

Buck TARGETS are added throughout the caffe2 tree to make available objects or headers that will be needed by torch::deploy but would have brought unsuitable dependencies if accessed using existing targets.

CMakeLists are modified to separate a torch-python-objs object library which lets torch::deploy compile these objs with the same compile flags as libttorch_python used, but without some of the link-time dependencies such as python.

CudaIPCTypes is moved from libtorch_python to libtorch_cuda because it is really not a python binding, and it statically registers a cuda_ipc_callback which would be duplicated if included in each copy of torch::deploy.

Test Plan: no new functionality, just ensure existing tests continue to pass

Reviewed By: malfet

Differential Revision: D25850785

fbshipit-source-id: b0b81c050cbee04e9de96888f8a09d29238a9db8
2021-01-22 09:16:32 -08:00
Jeff Daily
b2e5617553 [ROCm] rename HIP_HCC_FLAGS to HIP_CLANG_FLAGS (#50917)
Summary:
ROCm 3.5 replaced hcc with hip-clang and deprecated HIP_HCC_FLAGS.
HIP_CLANG_FLAGS should be used moving forward. HIP_HCC_FLAGS will
be removed soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50917

Reviewed By: ejguan

Differential Revision: D26008094

Pulled By: walterddr

fbshipit-source-id: cfec4f96fbd9bd338834a841c37267f6a4703cab
2021-01-22 07:24:05 -08:00
Ilia Cherniavskii
e34992ebee Set USE_KINETO=1 (#49897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49897

Resend of https://github.com/pytorch/pytorch/pull/49201

Test Plan: see 49201

Reviewed By: malfet

Differential Revision: D25717102

Pulled By: ilia-cher

fbshipit-source-id: 5e794a7f5fe160ca64ac9d190c4fd3e8f1e443e6
2021-01-22 00:09:21 -08:00
Luca Wehrstedt
112a583467 Enable TensorPipe's CMA channel (#50759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50759

ghstack-source-id: 120032288

Test Plan: Exported to CircleCI and tested

Reviewed By: mrshenli

Differential Revision: D25959326

fbshipit-source-id: be6df209ff3a79a8961acbda64ee7805a5c434a9
2021-01-20 10:53:47 -08:00
Mark Final
e1bb476980 Issue #48724. Only set the CMake IMPORTED_LOCATION property in static… (#49173)
Summary:
… library builds, as it is already set in shared library builds from the target that was imported from Caffe2.

This was identified on Windows builds when PyTorch was built in shared Release mode, and a testapp was built with RelWithDebInfo in CMake.
The problem appeared to be that because IMPORTED_LOCATION (in TorchConfig.cmake) and IMPORTED_LOCATION_RELEASE were both set (in Caffe2Targets.cmake), there occurred some confusion in the build as to what was correct. The symptoms are the error:

ninja: error: 'torch-NOTFOUND', needed by 'test_pytorch.exe', missing and no known rule to make it

in a noddy consuming test application.

Fixes https://github.com/pytorch/pytorch/issues/48724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49173

Reviewed By: malfet

Differential Revision: D25974151

Pulled By: ezyang

fbshipit-source-id: 3454c0d29cbbe7a37608beedaae3efbb624b0479
2021-01-20 09:23:27 -08:00
Rong Rong (AI Infra)
ebd142e94b initial commit to enable fast_nvcc (#49773)
Summary:
draft enable fast_nvcc.
* cleaned up some non-standard usages
* added fall-back to wrap_nvcc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49773

Test Plan:
Configuration to enable fast nvcc:
  - install and enable `ccache` but delete `.ccache/` folder before each build.
  - `TORCH_CUDA_ARCH_LIST=6.0;6.1;6.2;7.0;7.5`
  - Toggling `USE_FAST_NVCC=ON/OFF` cmake config and run `cmake --build` to verify the build time.

Initial statistic for a full compilation:
* `cmake --build . -- -j $(nproc)`:
  - fast NVCC
```
        real    48m55.706s
        user    1559m14.218s
        sys     318m41.138s
```
  - normal NVCC:
```
        real    43m38.723s
        user    1470m28.131s
        sys     90m46.879s
```
* `cmake --build . -- -j $(nproc/4)`:
  - fast NVCC:
```
        real    53m44.173s
        user    1130m18.323s
        sys     71m32.385s
```
  - normal  NVCC:
```
        real    81m53.768s
        user    858m45.402s
        sys     61m15.539s
```
* Conclusion: fast NVCC doesn't provide too much gain when compiler is set to use full CPU utilization, in fact it is **even worse** because of the thread switcing.

initial statistic for partial recompile (edit .cu files)

* `cmake --build . -- -j $(nproc)`
  - fast NVCC:
```
[2021-01-13 18:10:24] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 18:11:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
  - normal NVCC:
```
[2021-01-13 17:35:40] [ 86%] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryMiscOpsKernels.cu.o
[2021-01-13 17:38:08] [ 86%] Linking CXX shared library ../lib/libtorch_cuda.so
```
* Conclusion: Effective compilation time for single CU file modification reduced from from 2min30sec to only 40sec when compiling multiple architecture. This shows **4X** gain in speed up using fast NVCC -- reaching the theoretical limit of 5X when compiling 5 gencode architecture at the same time.

Follow up PRs:
- should have better fallback mechanism to detect whether a build is supported by fast_nvcc or not instead of dryruning then fail with fallback.
- performance measurement instrumentation to measure what's the total compile time vs the parallel tasks critical path time.
- figure out why `-j $(nproc)` gives significant sys overhead (`sys 318m41.138s` vs `sys 90m46.879s`) over normal nvcc, guess this is context switching, but not exactly sure

Reviewed By: malfet

Differential Revision: D25692758

Pulled By: walterddr

fbshipit-source-id: c244d07b9b71f146e972b6b3682ca792b38c4457
2021-01-19 14:50:54 -08:00
Richard Barnes
30a8ba93b1 Remove a blacklist reference (#50477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50477

See task for context

Test Plan: Sandcastle+OSS tests

Reviewed By: xush6528

Differential Revision: D25893906

fbshipit-source-id: c9b86d0292aa751597d75e8d1b53f99b99c924b9
2021-01-13 13:39:06 -08:00
Nathan John Sircombe
664126bab5 Enables build with oneDNN (MKL-DNN) on AArch64 (#50400)
Summary:
Since version 1.6, oneDNN has provided limited support for AArch64 builds.

This minor change is to detect an AArch64 CPU and permit the use of
`USE_MKLDNN` in that case.

Build flags for oneDNN are also modified accordingly.

Note: oneDNN on AArch64, by default, will use oneDNN's reference C++ kernels.
These are not optimised for AArch64, but oneDNN v1.7 onwards provides support
for a limited set of primitives based Arm Compute Library.
See: https://github.com/oneapi-src/oneDNN/pull/795
and: https://github.com/oneapi-src/oneDNN/pull/820
for more details. Support for ACL-based oneDNN primitives in PyTorch
will require some further modification,

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50400

Reviewed By: izdeby

Differential Revision: D25886589

Pulled By: malfet

fbshipit-source-id: 2c81277a28ad4528c2d2211381e7c6692d952bc1
2021-01-13 08:41:44 -08:00
Gemfield
deba3bd1d0 Fix TORCH_LIBRARIES variables when do static build (#49458)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/21737

With this fix, TORCH_LIBRARIES variable can provide all nessesary static libraries build from pytorch repo.
User program (if do static build) now can just link with ${TORCH_LIBRARIES} + MKL + cuda runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49458

Reviewed By: mrshenli

Differential Revision: D25895354

Pulled By: malfet

fbshipit-source-id: 8ff47d14ae1f90036522654d4354256ed5151e5c
2021-01-13 07:56:27 -08:00
Chester Liu
bee6b0be58 Fix warning when running scripts/build_ios.sh (#49457)
Summary:
* Fixes `cmake implicitly converting 'string' to 'STRING' type`
* Fixes `clang: warning: argument unused during compilation: '-mfpu=neon-fp16' [-Wunused-command-line-argument]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49457

Reviewed By: zhangguanheng66

Differential Revision: D25871014

Pulled By: malfet

fbshipit-source-id: fa0c181ae7a1b8668e47f5ac6abd27a1c735ffce
2021-01-11 19:31:32 -08:00
Jane Xu
c2d37cd990 Change CMake config to enable universal binary for Mac (#50243)
Summary:
This PR is a step towards enabling cross compilation from x86_64 to arm64.

The following has been added:
1. When cross compilation is detected, compile a local universal fatfile to use as protoc.
2. For the simple compile check in MiscCheck.cmake, make sure to compile the small snippet as a universal binary in order to run the check.

**Test plan:**

Kick off a minimal build on a mac intel machine with the macOS 11 SDK with this command:
```
CMAKE_OSX_ARCHITECTURES=arm64 USE_MKLDNN=OFF USE_QNNPACK=OFF USE_PYTORCH_QNNPACK=OFF BUILD_TEST=OFF USE_NNPACK=OFF python setup.py install
```
(If you run the above command before this change, or without macOS 11 SDK set up, it will fail.)

Then check the platform of the built binaries using this command:
```
lipo -info build/lib/libfmt.a
```
Output:
- Before this PR, running a regular build via `python setup.py install` (instead of using the flags listed above):
  ```
  Non-fat file: build/lib/libfmt.a is architecture: x86_64
  ```
- Using this PR:
  ```
  Non-fat file: build/lib/libfmt.a is architecture: arm64
  ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50243

Reviewed By: malfet

Differential Revision: D25849955

Pulled By: janeyx99

fbshipit-source-id: e9853709a7279916f66aa4c4e054dfecced3adb1
2021-01-08 17:26:08 -08:00
Nikita Shulga
7b4a7661d6 Make PyTorch partially cross-compilable for Apple M1 (#49701)
Summary:
Update CPUINFO to include https://github.com/pytorch/cpuinfo/pull/51
Update sleef to include https://github.com/shibatch/sleef/pull/376
Modify aten/src/ATen/native/quantized/cpu/qnnpack/CMakeLists.txt to recognize CMAKE_OSX_ARCHITECTURES

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49701

Test Plan: `cmake -DCMAKE_OSX_ARCHITECTURES=x86_64 -DPYTHON_EXECUTABLE=/usr/bin/python3  -DUSE_XNNPACK=NO -DBUILD_TEST=YES .. -G Ninja; ninja basic` finishes successfully on Apple M1

Reviewed By: janeyx99

Differential Revision: D25669219

Pulled By: malfet

fbshipit-source-id: 5ee36b64e3a7ac76448f2a300ac4993375a26de5
2020-12-22 09:33:12 -08:00
Jane Xu
71ca600af9 Renaming CAFFE2_API to TORCH_API (#49496)
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.

Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496

Reviewed By: malfet, samestep

Differential Revision: D25600726

Pulled By: janeyx99

fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
2020-12-18 10:54:50 -08:00
Ilia Cherniavskii
72b00a8a52 Revert D25480770: Set USE_KINETO=1
Test Plan: revert-hammer

Differential Revision:
D25480770 (1a92802bde)

Original commit changeset: 037cd774f554

fbshipit-source-id: 6a6062195033ca91fcc0cfa1e890e47efc774ac1
2020-12-18 07:06:28 -08:00
Ilia Cherniavskii
1a92802bde Set USE_KINETO=1 (#49201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49201

This unblocks kineto profiler for 1.8 release.
This PR supercedes https://github.com/pytorch/pytorch/pull/48391
Note: this will somewhat increase the size of linux server binaries, bc
we add libkineto.a and libcupti_static.a:
-rw-r--r-- 1 jenkins jenkins 1107502 Dec 10 21:16 build/lib/libkineto.a
-rw-r--r-- 1 root root 13699658 Nov 13  2019 /usr/local/cuda/lib64/libcupti_static.a

Test Plan:
CI
https://github.com/pytorch/pytorch/pull/48391

Imported from OSS

Reviewed By: ngimel

Differential Revision: D25480770

fbshipit-source-id: 037cd774f5547d9918d6055ef5cc952a54e48e4c
2020-12-18 01:48:10 -08:00
Rong Rong
88b3d3371b add additional arm64 checker in cmake files (#48952)
Summary:
tentatively fixes https://github.com/pytorch/pytorch/issues/48873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48952

Reviewed By: H-Huang

Differential Revision: D25463266

Pulled By: walterddr

fbshipit-source-id: 40afefffe8ab98ae7261c770316cb9c25225285f
2020-12-11 08:10:09 -08:00
Abdelrauf
95a1725a4a Vsx initial support issue27678 (#41541)
Summary:
### Pytorch Vec256 ppc64le support
implemented types:

- double
- float
- int16
- int32
- int64
- qint32
- qint8
- quint8
- complex_float
- complex_double

Notes:
All basic vector operations are implemented:
There are a few problems:
- minimum maximum nan propagation for ppc64le is missing and was not checked
- complex multiplication, division, sqrt, abs are implemented as PyTorch x86. they can overflow and have precision problems than std ones.  That's why they were either excluded or tested in smaller domain range
- precisions of the implemented float math functions

~~Besides, I added CPU_CAPABILITY for power. but as because of  quantization errors for DEFAULT I had to undef and  use vsx for DEFAULT too~~

#### Details
##### Supported math functions

+ plus sign means vectorized, -  minus sign means missing,   (implementation notes are added inside braces)
(notes). Example: -(both ) means it was also missing on x86 side
g( func_name)  means vectorization is using func_name
sleef - redirected to the Sleef
unsupported

function_name | float | double | complex float | complex double
|-- | -- | -- | -- | --|
acos | sleef | sleef | f(asin) | f(asin)
asin | sleef | sleef | +(pytorch impl) | +(pytorch impl)
atan | sleef | sleef | f(log) | f(log)
atan2 | sleef | sleef | unsupported | unsupported
cos | +((ppc64le:avx_mathfun) ) | sleef | -(both) | -(both)
cosh | f(exp)   | -(both) | -(both) |
erf | sleef | sleef | unsupported | unsupported
erfc | sleef | sleef | unsupported | unsupported
erfinv | - (both) | - (both) | unsupported | unsupported
exp | + | sleef | - (x86:f()) | - (x86:f())
expm1 | f(exp)  | sleef | unsupported | unsupported
lgamma | sleef | sleef |   |
log | +  | sleef | -(both) | -(both)
log10 | f(log)  | sleef | f(log) | f(log)
log1p | f(log)  | sleef | unsupported | unsupported
log2 | f(log)  | sleef | f(log) | f(log)
pow | + f(exp)  | sleef | -(both) | -(both)
sin | +((ppc64le:avx_mathfun) ) | sleef | -(both) | -(both)
sinh | f(exp)  | sleef | -(both) | -(both)
tan | sleef | sleef | -(both) | -(both)
tanh | f(exp)  | sleef | -(both) | -(both)
hypot | sleef | sleef | -(both) | -(both)
nextafter | sleef  | sleef | -(both) | -(both)
fmod | sleef | sleef | -(both) | -(both)

[Vec256 Test cases Pr https://github.com/pytorch/pytorch/issues/42685](https://github.com/pytorch/pytorch/pull/42685)
Current list:

- [x] Blends
- [x] Memory: UnAlignedLoadStore
- [x] Arithmetics: Plus,Minu,Multiplication,Division
- [x] Bitwise: BitAnd, BitOr, BitXor
- [x] Comparison: Equal, NotEqual, Greater, Less, GreaterEqual, LessEqual
- [x] MinMax: Minimum, Maximum, ClampMin, ClampMax, Clamp
- [x] SignManipulation: Absolute, Negate
- [x] Interleave: Interleave, DeInterleave
- [x] Rounding: Round, Ceil, Floor, Trunc
- [x] Mask: ZeroMask
- [x] SqrtAndReciprocal: Sqrt, RSqrt, Reciprocal
- [x] Trigonometric: Sin, Cos, Tan
- [x] Hyperbolic: Tanh, Sinh, Cosh
- [x] InverseTrigonometric: Asin, ACos, ATan, ATan2
- [x] Logarithm: Log, Log2, Log10, Log1p
- [x] Exponents: Exp, Expm1
- [x] ErrorFunctions: Erf, Erfc, Erfinv
- [x] Pow: Pow
- [x] LGamma: LGamma
- [x] Quantization: quantize, dequantize, requantize_from_int
- [x] Quantization: widening_subtract, relu, relu6
Missing:
- [ ] Constructors, initializations
- [ ] Conversion , Cast
- [ ] Additional: imag, conj, angle (note: imag and conj only checked for float complex)

#### Notes on tests and testing framework
- some math functions are tested within domain range
- mostly testing framework randomly tests against std implementation within the domain or within the implementation domain for some math functions.
- some functions are tested against the local version. ~~For example, std::round and vector version of round differs. so it was tested against the local version~~
- round was tested against pytorch at::native::round_impl. ~~for double type on **Vsx  vec_round failed  for  (even)+0 .5 values**~~ . it was solved by using vec_rint
- ~~**complex types are not tested**~~  **After enabling complex testing due to precision and domain some of the complex functions failed for vsx and x86 avx as well. I will either test it against local implementation or check within the accepted domain**
- ~~quantizations are not tested~~  Added tests for quantizing, dequantize, requantize_from_int, relu, relu6, widening_subtract functions
- the testing framework should be improved further
- ~~For now `-DBUILD_MOBILE_TEST=ON `will be used for Vec256Test too~~
Vec256 Test cases will be built for each CPU_CAPABILITY

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41541

Reviewed By: zhangguanheng66

Differential Revision: D23922049

Pulled By: VitalyFedyunin

fbshipit-source-id: bca25110afccecbb362cea57c705f3ce02f26098
2020-12-10 13:42:39 -08:00
peterjc123
5450614cf6 Correctly apply WIN32_LEAN_AND_MEAN to the whole repo (#49025)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49025

Reviewed By: zhangguanheng66

Differential Revision: D25399912

Pulled By: ezyang

fbshipit-source-id: 9b7225b0e43511e0b8981c39035d814a4406c523
2020-12-08 19:38:23 -08:00
Rong Rong
b89c328493 Add fftw3 cmake as alternative for FFT/DFT (#48808)
Summary:
added cmake discovery in Dependencies.cmake for fftw3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48808

Reviewed By: janeyx99

Differential Revision: D25375320

Pulled By: walterddr

fbshipit-source-id: cde3afc51eef9c621c7d19be7ad7573fc8b838c2
2020-12-08 10:35:33 -08:00
Nikita Shulga
c29f51642e Modify NEON check for ARM64 on OS X (#48982)
Summary:
Use CMAKE_SYSTEM_PROCESSOR rather than run sysctl

Fixes https://github.com/pytorch/pytorch/issues/48874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48982

Reviewed By: walterddr

Differential Revision: D25385883

Pulled By: malfet

fbshipit-source-id: 47b6dc5be8d75f6d4a66a11c564abdfe31ac90b4
2020-12-08 07:58:22 -08:00
Jithun Nair
5f62308739 Hipify revamp [REDUX] (#48715)
Summary:
[Refiled version of earlier PR https://github.com/pytorch/pytorch/issues/45451]

This PR revamps the hipify module in PyTorch to overcome a long list of shortcomings in the original implementation. However, these improvements are applied only when using hipify to build PyTorch extensions, not for PyTorch or Caffe2 itself.

Correspondingly, changes are made to cpp_extension.py to match these improvements.

The list of improvements to hipify is as follows:

1. Hipify files in the same directory as the original file, unless there's a "cuda" subdirectory in the original file path, in which case the hipified file will be in the corresponding file path with "hip" subdirectory instead of "cuda".
2. Never hipify the file in-place if changes are introduced due to hipification i.e. always ensure the hipified file either resides in a different folder or has a different filename compared to the original file.
3. Prevent re-hipification of already hipified files. This avoids creation of unnecessary "hip/hip" etc. subdirectories and additional files which have no actual use.
4. Do not write out hipified versions of files if they are identical to the original file. This results in a cleaner output directory, with minimal number of hipified files created.
5. Update header rewrite logic so that it accounts for the previous improvement.
6. Update header rewrite logic so it respects the rules for finding header files depending on whether "" or <> is used.
7. Return a dictionary of mappings of original file paths to hipified file paths from hipify function.
8. Introduce a version for hipify module to allow extensions to contain back-compatible code that targets a specific point in PyTorch where the hipify functionality changed.
9. Update cuda_to_hip_mappings.py to account for the ROCm component subdirectories inside /opt/rocm/include. This also results in cleanup of the Caffe2_HIP_INCLUDE path to remove unnecessary additions to the include path.

The list of changes to cpp_extension.py is as follows:

1. Call hipify when building a CUDAExtension for ROCm.
2. Prune the list of source files to CUDAExtension to include only the hipified versions of any source files in the list (if both original and hipified versions of the source file are in the list)
3. Add subdirectories of /opt/rocm/include to the include path for extensions, so that ROCm headers for subcomponent libraries are found automatically

cc jeffdaily sunway513 ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48715

Reviewed By: bdhirsh

Differential Revision: D25272824

Pulled By: ezyang

fbshipit-source-id: 8bba68b27e41ca742781e1c4d7b07c6f985f040e
2020-12-02 18:03:23 -08:00
Daily, Jeff
7f869dca70 [ROCm] update debug flags (#46717)
Summary:
Improves support for rocgdb when setting DEBUG=1 and building for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46717

Reviewed By: mrshenli

Differential Revision: D25171544

Pulled By: malfet

fbshipit-source-id: b4699ba2277dcb89f07efb86f7153fae82a80dc3
2020-11-30 15:27:30 -08:00
Ashkan Aliabadi
66440d1b29 Tweak Vulkan memory use. (#47728)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47728

Test Plan: Imported from OSS

Reviewed By: SS-JIA

Differential Revision: D25032740

Pulled By: AshkanAliabadi

fbshipit-source-id: 7eb72538dc1aa3feb4e2f8c4ff9c675eb8e97057
2020-11-30 14:28:09 -08:00
Rong Rong
af520d9d04 [cmake] clean up blas discovery (#47940)
Summary:
remove useless variable changes in blas discovery

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47940

Reviewed By: malfet

Differential Revision: D25122228

Pulled By: walterddr

fbshipit-source-id: 12bc3ce9e4f89a72b6a92c10d14024e5941f4b96
2020-11-30 10:29:50 -08:00
Nikita Shulga
e7ca62be08 Fix PyTorch compilation on Apple M1 (#48275)
Summary:
Update cpuinfo and sleef to contain build fixes for M1

Fixes https://github.com/pytorch/pytorch/issues/48145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48275

Reviewed By: walterddr

Differential Revision: D25135153

Pulled By: malfet

fbshipit-source-id: 2a82e14407d6f40c7dacd11109a8499d808c8ec1
2020-11-26 07:08:33 -08:00
Ilia Cherniavskii
f7a8bf2855 Use libkineto in profiler (#46470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470

Adding ability to use Kineto (CUPTI) to profile CUDA kernels

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py

python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       1.000us             2
                                      sgemm_32x32x32_NN         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        33.33%       2.000us       2.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        16.67%       1.000us       1.000us             1
                                            aten::randn         5.17%      74.000us         6.71%      96.000us      48.000us       0.000us         0.00%       0.000us       0.000us             2
                                            aten::empty         1.33%      19.000us         1.33%      19.000us       4.750us       0.000us         0.00%       0.000us       0.000us             4
                                          aten::normal_         1.05%      15.000us         1.05%      15.000us       7.500us       0.000us         0.00%       0.000us       0.000us             2
                                               aten::to        77.90%       1.114ms        91.61%       1.310ms     436.667us       0.000us         0.00%       3.000us       1.000us             3
                                    aten::empty_strided         2.52%      36.000us         2.52%      36.000us      12.000us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::copy_         2.73%      39.000us        11.19%     160.000us      53.333us       0.000us         0.00%       3.000us       1.000us             3
                                        cudaMemcpyAsync         4.34%      62.000us         4.34%      62.000us      20.667us       0.000us         0.00%       0.000us       0.000us             3
                                  cudaStreamSynchronize         1.61%      23.000us         1.61%      23.000us       7.667us       0.000us         0.00%       0.000us       0.000us             3
                                               aten::mm         0.21%       3.000us         7.20%     103.000us     103.000us       0.000us         0.00%       2.000us       2.000us             1
                                           aten::stride         0.21%       3.000us         0.21%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3
                                       cudaLaunchKernel         2.45%      35.000us         2.45%      35.000us      17.500us       0.000us         0.00%       0.000us       0.000us             2
                                              aten::add         0.49%       7.000us         4.27%      61.000us      61.000us       0.000us         0.00%       1.000us       1.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
```

benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a

Reviewed By: Chillee

Differential Revision: D25142223

Pulled By: ilia-cher

fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Ilia Cherniavskii
f2da18af14 Add USE_KINETO build option (#45888)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45888

Adding USE_LIBKINETO build option

Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python
setup.py develop install --cmake

Reviewed By: Chillee

Differential Revision: D25142221

Pulled By: ilia-cher

fbshipit-source-id: d1634a8f9599604ff511fac59b9072854289510c
2020-11-21 20:20:32 -08:00
Bert Maher
8a996dd139 [te] Make BUILD_TENSOREXPR_BENCHMARK a real CMake option (#48158)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48158

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25059877

Pulled By: bertmaher

fbshipit-source-id: a98b6c18a91b4fe89d12bf5f7ead604e3cc0c8b0
2020-11-18 12:19:14 -08:00
Nikita Shulga
8af9f2cc23 Revert D24924736: [pytorch][PR] Hipify revamp
Test Plan: revert-hammer

Differential Revision:
D24924736 (10b490a3e0)

Original commit changeset: 4af42b8ff4f2

fbshipit-source-id: 7f8f90d55d8a69a2890ec73622fcea559189e381
2020-11-18 11:48:30 -08:00
Jithun Nair
10b490a3e0 Hipify revamp (#45451)
Summary:
This PR revamps the hipify module in PyTorch to overcome a long list of shortcomings in the original implementation. However, these improvements are applied only when using hipify to build PyTorch extensions, **not for PyTorch or Caffe2 itself**.

Correspondingly, changes are made to `cpp_extension.py` to match these improvements.

The list of improvements to hipify is as follows:

1. Hipify files in the same directory as the original file, unless there's a "cuda" subdirectory in the original file path, in which case the hipified file will be in the corresponding file path with "hip" subdirectory instead of "cuda".
2. Never hipify the file in-place if changes are introduced due to hipification i.e. always ensure the hipified file either resides in a different folder or has a different filename compared to the original file.
3. Prevent re-hipification of already hipified files. This avoids creation of unnecessary "hip/hip" etc. subdirectories and additional files which have no actual use.
4. Do not write out hipified versions of files if they are identical to the original file. This results in a cleaner output directory, with minimal number of hipified files created.
5. Update header rewrite logic so that it accounts for the previous improvement.
6. Update header rewrite logic so it respects the rules for finding header files depending on whether `""` or `<>` is used.
7. Return a dictionary of mappings of original file paths to hipified file paths from `hipify` function.
8. Introduce a version for hipify module to allow extensions to contain back-compatible code that targets a specific point in PyTorch where the hipify functionality changed.
9. Update `cuda_to_hip_mappings.py` to account for the ROCm component subdirectories inside `/opt/rocm/include`. This also results in cleanup of the `Caffe2_HIP_INCLUDE` path to remove unnecessary additions to the include path.

The list of changes to `cpp_extension.py` is as follows:
1. Call `hipify` when building a CUDAExtension for ROCm.
2. Prune the list of source files to CUDAExtension to include only the hipified versions of any source files in the list (if both original and hipified versions of the source file are in the list)
3. Add subdirectories of /opt/rocm/include to the include path for extensions, so that ROCm headers for subcomponent libraries are found automatically

cc jeffdaily sunway513 hgaspar lcskrishna ashishfarmer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45451

Reviewed By: ezyang

Differential Revision: D24924736

Pulled By: malfet

fbshipit-source-id: 4af42b8ff4f21c3782dedb8719b8f9f86b34bd2d
2020-11-18 08:37:49 -08:00
Samuel Aldana
8c00221fe2 Fix inconsistent environment variable naming for setting NVTOOLEXT_HOME in TorchConfig.cmake (#48012)
Summary:
When building libtorch with CUDA installed in some unconventional
location, CMake files rely on some environment variables to set cmake
variable, in particular NVTOOLSEXT_PATH environment variable is used to
set NVTOOLEXT_HOME in cmake/public/cuda.cmake. Later when consuming
such build using the generated cmake finder TorchConfig.cmake, another
convention is used which feels rather inconsistent, relying on a
completly new environment variable NVTOOLEXT_HOME, although the former
way is still in place, cmake/public/cuda.cmake being transitively called
via Caffe2Config.cmake, which is called by TorchConfig.cmake

Fixes https://github.com/pytorch/pytorch/issues/48032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48012

Reviewed By: gchanan

Differential Revision: D25031260

Pulled By: ezyang

fbshipit-source-id: 0d6ab8ba9f52dd10be418b1a92b0f53c889f3f2d
2020-11-18 07:37:53 -08:00
Tao Xu
3846e35a55 [GPU] Enable Metal on macosx (#47635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47635

Add macosx support for metal. The supported os version is 10.13 and above.
ghstack-source-id: 116845318

Test Plan:
1. Sandcastle Tests
2. CircleCI Jobs
3. In the next diff, we'll run the person segmentation model inside a macos app

Reviewed By: dreiss

Differential Revision: D24825088

fbshipit-source-id: 10d7976c953e765599002dc42d7f8d248d7c9846
2020-11-17 14:44:34 -08:00
Rong Rong
147a48fb27 [cmake] clean up cmake/Utils.cmake (#47923)
Summary:
Consolidate into cmake/public/utils.cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47923

Reviewed By: samestep

Differential Revision: D24955961

Pulled By: walterddr

fbshipit-source-id: 9d5f6af2b353a8c6f6d521c841fd0989393755cd
2020-11-16 08:12:32 -08:00
Rong Rong
7391edb591 [hotfix] fix misleadingly summary BLAS=MKL when there's no BLAS install (#47803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47803

Reviewed By: samestep

Differential Revision: D24907453

Pulled By: walterddr

fbshipit-source-id: a3e41041f6aa506b054eb0ffc61f8525ba02cbf1
2020-11-12 16:05:14 -08:00
Nikita Shulga
e8a73fbf34 Workaround PyTorch debug build crash using old GCC (#47805)
Summary:
gcc-7.4.x or older fails to compile XNNPACK in debug mode with internal compiler error
Workaround this in a build script by pasing -O1 optimisation flag to XNNPACK if compiled on older compilers

Fixes https://github.com/pytorch/pytorch/issues/47292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47805

Reviewed By: seemethere

Differential Revision: D24905758

Pulled By: malfet

fbshipit-source-id: 93f4e3b3b5c10b69734627c50e36b2eb544699c8
2020-11-11 16:33:47 -08:00