This PR introduces selective build to lightweight dispatch CI job. By doing so we can't run the `test_lite_intepreter_runtime` test suite anymore because it requires some other operators.
From now on, if we are adding a new unit test in `test_codegen_unboxing`, we will have to export the operators for the unit test model and add them into `lightweight_dispatch_ops.yaml`. This can be automated by introducing tracing based selective build, but that's for next PR to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78983
Approved by: https://github.com/kit1980
Otherwise, its possible to build TensorPipe with one version of libuv
and gloo with another.
Also, delete strange `GLOO_INSTALL` logic, as none of the install artifacts are really packaged as part of PyTorch (and it were probably used by Caffe2 builds)
This helps solve problem for compiling PyTorch for M1, where `libuv` is not available in conda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77312
Approved by: https://github.com/seemethere
Re-landing #68111/#74596
## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of #50256, the below improvements are included:
* The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
* The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
* SkyLake 8180 (1 socket of 28 cores):

* SkyLake 8180 (single thread):

* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code is placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is in:
```
caffe2/CMakeLists.txt
cmake/public/mkldnn.cmake
cmake/Modules/FindMKLDNN.cmake
```
## Limitations
* In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
* We have only optimized the inference use-case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
This functionality does not seem to be used
and there are some requests to update dependency.
Add `third_party` to torch_cpu include directories if compiling with
Caffe2 support, as `caffe2/quantization/server/conv_dnnlowp_op.cc` depends on `third_party/fbgemm/src/RefImplementations.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
This was causing the shaders to be incorrectly templated because
both the precision argument and the format argument were being treated
as a single argument by argparse and therefore pasted into shaders
incorrectly. In turn this meant that shaders couldn't be compiled
when the precision or format options were turned on.
Fixes#76195
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76196
Approved by: https://github.com/dagitses
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75605
Usecase: Milan models have multiple backends and need to use static dispatch to save on static initialization time and to hit native functions directly from the unboxed APIs.
This change passes in List[BackendIndex] and adds ability to generate code for multiple static backends with 1 or 0 kernels
ghstack-source-id: 154525738
(Note: this ignores all push blocking failures!)
Test Plan:
Builds lite_predictor_flatbuffer with multiple backends
```
buck build --config pt.enable_lightweight_dispatch=1 --config pt.static_dispatch_backend=CPU,QuantizedCPU,CompositeExplicitAutograd //xplat/caffe2/fb/lite_predictor:lite_predictor_flatbuffer
```
Reviewed By: larryliu0820
Differential Revision: D35510644
fbshipit-source-id: f985718ad066f8578b006b4759c4a3bd6caac176
(cherry picked from commit a6999729c8cc26c54b8d5684f6585d6c50d8d913)
Summary:
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.
Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.
Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538
Reviewed By: anjali411
Differential Revision: D35747333
Pulled By: malfet
fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626
(cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)
Summary:
RCCL is required by two components in hipified Pytorch: (1) gloo and (2) hipified ProcessGroupNCCL.
- For (1) the RCCL dependency is managed in `./third_party/gloo/cmake/Dependencies.cmake` and can be enabled/disabled via `USE_RCCL`.
- For (2) the RCCL dependency is managed via `./cmake/Dependencies.cmake` and can be on/off via `USE_NCCL`.
The additional dependency removed in this commit forced hipified Pytorch to load librccl.so even when USE_RCCL=OFF USE_NCCL=OFF is set, i.e., when using torch_ucc/ucc for AMD GPU mem type. This caused conflicts when we use a non-system default librccl.so (i.e., not in ROCM_PATH) for torch_ucc/ucc.
This commit removes the unnecessary RCCL dependency. This will ensure a cleaner way to use torch_ucc with a user-specified RCCL library.
Test Plan:
## Verify OSS pytorch on an AMD GPU machine (MI100)
```
ROCM_PATH=/opt/rocm-4.5.2
git clone https://github.com/pytorch/pytorch.git
cd pytorch
python3 tools/amd_build/build_amd.py
USE_NCCL=0 USE_RCCL=0 USE_KINETO=0 with-proxy python3 setup.py develop
USE_NCCL=0 USE_RCCL=0 USE_KINETO=0 with-proxy python3 setup.py install
```
log for develop: P492778257
log for install: P492778277
## Verify OSS pytorch + TorchUCC on an AMD GPU machine (MI100)
```
export RCCL_INSTALL_DIR=/opt/rccl-rocm-rel-4.4
git clone https://github.com/facebookresearch/torch_ucc.git
cd torch_ucc
UCX_HOME=$RCCL_INSTALL_DIR UCC_HOME=$RCCL_INSTALL_DIR WITH_CUDA=$ROCM_PATH python setup.py
# run param comm
export HSA_ENABLE_SDMA=0
export LD_LIBRARY_PATH=$RCCL_INSTALL_DIR
cd test
git clone https://github.com/facebookresearch/param
cd ..
/bin/bash ./test/start_test.sh ./test/param/train/comms/pt/comms.py --backend ucc --device cuda --b 4 --e 4M --c 1 --collective all_reduce
```
- log for param comm: P493033836
- Verified librccl.so in `/opt/rccl-rocm-rel-4.4` is used via checking version string in log. "[localbuild]" is added in RCCL source.
```
RCCL version 2.9.9+hip4.4 [localbuild]
```
Differential Revision: D35476911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75547
Approved by: https://github.com/malfet, https://github.com/jeffdaily
Summary:
This is the very first step for the UCC-NCCL integration. This PR lets `ProcessGroupNCCL` load the `torch_ucc.so` if the user specifies an environmental variable `TORCH_UCC_LIBRARY_PATH`. If this environment variable is not specified by the user, then there will be no visible change.
In the future, we may want to make PyTorch smart enough to automatically detect the `torch_ucc.so` in the user's system, but before doing that, I believe we should first make sure that `ProcessGroupUCC` is very well tested.
Note that in this PR, `ProcessGroupNCCL` just loads the library but will not use it. I am trying to make PRs small, so the usage of `torch_ucc.so` will be submitted in later PRs.
This PR requires the change in https://github.com/facebookresearch/torch_ucc/pull/56, otherwise `torch_ucc.so` can not be successfully loaded. But his PR can be landed separately without waiting for https://github.com/facebookresearch/torch_ucc/pull/56 because, in PyTorch's unit tests, UCC is never used or tested.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69552
Reviewed By: mruberry
Differential Revision: D34675212
Pulled By: jiayisuse
fbshipit-source-id: a3d1fb98340dbe3a931af555423863efd381f1ae
(cherry picked from commit 3778b6fabe70c26b5a65e6ddec641d2ef9113cd1)
Summary:
Also enables bazel build to run lazy codegen. Bazel (oss) build feeds off the same filelists as cmake/buck (build_variables.bzl), so enabling it is easier than keeping it disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74111
Test Plan: Run CI and verify test_lazy_ops is running via OSS cmake builds
Reviewed By: bdhirsh
Differential Revision: D34772403
fbshipit-source-id: 8a63f58b9536e6ac1be530667932176ef2549496
(cherry picked from commit e807ffb1918853d10b924fdc24f85ee5b1a39021)
Summary:
## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:
- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

- SkyLake 8180 (single thread):

\* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is:
```
caffe2/CMakeLists.txt
```
## Limitations
- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111
Reviewed By: eellison
Differential Revision: D34584878
Pulled By: malfet
fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4
(cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)
Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols. Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74009
Approved by: https://github.com/malfet
Summary:
RFC: https://github.com/pytorch/rfcs/pull/40
This PR (re)introduces python codegen for unboxing wrappers. Given an entry of `native_functions.yaml` the codegen should be able to generate the corresponding C++ code to convert ivalues from the stack to their proper types. To trigger the codegen, run
```
tools/jit/gen_unboxing.py -d cg/torch/share/ATen
```
Merged changes on CI test. In https://github.com/pytorch/pytorch/issues/71782 I added an e2e test for static dispatch + codegen unboxing. The test exports a mobile model of mobilenetv2, load and run it on a new binary for lite interpreter: `test/mobile/custom_build/lite_predictor.cpp`.
## Lite predictor build specifics
1. Codegen: `gen.py` generates `RegisterCPU.cpp` and `RegisterSchema.cpp`. Now with this PR, once `static_dispatch` mode is enabled, `gen.py` will not generate `TORCH_LIBRARY` API calls in those cpp files, hence avoids interaction with the dispatcher. Once `USE_LIGHTWEIGHT_DISPATCH` is turned on, `cmake/Codegen.cmake` calls `gen_unboxing.py` which generates `UnboxingFunctions.h`, `UnboxingFunctions_[0-4].cpp` and `RegisterCodegenUnboxedKernels_[0-4].cpp`.
2. Build: `USE_LIGHTWEIGHT_DISPATCH` adds generated sources into `all_cpu_cpp` in `aten/src/ATen/CMakeLists.txt`. All other files remain unchanged. In reality all the `Operators_[0-4].cpp` are not necessary but we can rely on linker to strip them off.
## Current CI job test coverage update
Created a new CI job `linux-xenial-py3-clang5-mobile-lightweight-dispatch-build` that enables the following build options:
* `USE_LIGHTWEIGHT_DISPATCH=1`
* `BUILD_LITE_INTERPRETER=1`
* `STATIC_DISPATCH_BACKEND=CPU`
This job triggers `test/mobile/lightweight_dispatch/build.sh` and builds `libtorch`. Then the script runs C++ tests written in `test_lightweight_dispatch.cpp` and `test_codegen_unboxing.cpp`. Recent commits added tests to cover as many C++ argument type as possible: in `build.sh` we installed PyTorch Python API so that we can export test models in `tests_setup.py`. Then we run C++ test binary to run these models on lightweight dispatch enabled runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69881
Reviewed By: iseeyuan
Differential Revision: D33692299
Pulled By: larryliu0820
fbshipit-source-id: 211e59f2364100703359b4a3d2ab48ca5155a023
(cherry picked from commit 58e1c9a25e3d1b5b656282cf3ac2f548d98d530b)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/73377
We've migrated to CUDA-11.3 as default toolkit in 1.9, it's time to stop builds (especially considering forward-compatibility guarantee across CUDA-11.x drivers)
Hence we are removing CUDA 11.1 support. We should also cleanup old cuda related code from our builder and pytorch repo making scripts a little more clean.
We have code that references cuda 9.2 , 10.1 , 11.0, 11.1, 11.2 and none of these are currently use
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73514
Reviewed By: janeyx99
Differential Revision: D34551989
Pulled By: atalman
fbshipit-source-id: 9ceaaa9b25ad49689986f4b29a26d20370d9d011
(cherry picked from commit fe109c62daf429e9053c03f6e374568ba23cd041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73040
This patch fixes a compilation error in PyTorch with ROCm when `NDEBUG` is passed.
## Problem
Forward declaration of `__host__ __device__ __assert_fail()` is used in `c10/macros/Macros.h` for HIP compilation when `NDEBUG` is set However, HIP has `__device__ __assert_fail()` in `hip/amd_detail/amd_device_functions.h`, causing a function type error.
This issue does not appear in ROCm CI tests since it happens only when `NDEBUG` is passed.
## Solution
[EDIT] After the discussion on GitHub, we chose to entirely disable `CUDA_KERNEL_ASSERT()` for ROCm.
---
To solve this compilation error, this patch disables `CUDA_KERNEL_ASSERT()`, which uses `__assert_fail()` when
1. `c10/macros/Macros.h` is included for `*.hip` (precisely speaking, `__HIP__` or `__HIP_ARCH__` is defined), and
2. `NDEBUG` is passed.
Note that there's no impact on default compilation because, without a special compilation flag, those HIP files are compiled without `-NDEBUG`. And that's why this issue has not been found.
### Justification
[1] We cannot declare one host-and-device function for two separate host and device functions.
```
__device__ int func() {return 0};
__host__ int func() {return 0};
// Compile error (hipcc)
// __device__ __host__ int func();
```
[2] Forward declaration of a correct `__device__` only `__assert_fail()` for `__HIP__` causes the following error:
```
pytorch/c10/util/TypeCast.h:135:7: error: reference to __device__ function '__assert_fail' in __host__ __device__ function
ERROR_UNSUPPORTED_CAST
^
pytorch/c10/util/TypeCast.h:118:32: note: expanded from macro 'ERROR_UNSUPPORTED_CAST'
#define ERROR_UNSUPPORTED_CAST CUDA_KERNEL_ASSERT(false);
^
pytorch/c10/macros/Macros.h:392:5: note: expanded from macro 'CUDA_KERNEL_ASSERT'
__assert_fail(
```
[3] Maybe there's a way to properly define `__assert_fail()` for HIP + NDEBUG, but this might be too much. Please let me just disable it.
### Technical details
Error
```
pytorch/c10/macros/Macros.h:368:5: error: __host__ __device__ function '__assert_fail' cannot overload __device__ function '__assert_fail'
__assert_fail(
^
/opt/rocm/hip/include/hip/amd_detail/amd_device_functions.h:1173:6: note: previous declaration is here
void __assert_fail(const char *assertion,
```
CUDA definition (9.x) of `__assert_fail()`
```
#elif defined(__GNUC__)
extern __host__ __device__ __cudart_builtin__ void __assert_fail(
const char *, const char *, unsigned int, const char *)
__THROW;
```
ROCm definition (the latest version)
```
// 2b59661f3e/include/hip/amd_detail/amd_device_functions.h (L1172-L1177)
extern "C" __device__ __attribute__((noinline)) __attribute__((weak))
void __assert_fail(const char *assertion,
const char *file,
unsigned int line,
const char *function);
```
Test Plan:
CI + reproducer
```
python3 tools/amd_build/build_amd.py
python3 setup.py develop --cmake-only
cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
cmake --build build
```
Reviewed By: xw285cornell
Differential Revision: D34310555
fbshipit-source-id: 7542288912590533ced3f20afd2e704b6551991b
(cherry picked from commit 9e52196e36820abe36bf6427cabc7389d3ea6cb5)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65851
Design doc: https://docs.google.com/document/d/12rtlHnPUpaJ-I52Iob3L0WA3rKRr_OY7fXqeCvn2MVY/edit
First read the design doc to understand the user syntax. In this PR, we have converted add to use ufunc codegen; most of the cpp changes are deleting the preexisting implementations of add, and ufunc/add.h are the new implementations in the ufunc format.
The bulk of this PR is in the new codegen machinery. Here's the order to read the files:
* `tools/codegen/model.py`
* Some self-explanatory utility classes: `ScalarType`, `DTYPE_CLASSES`
* New classes for representing ufunc entries in `native_functions.yaml`: `UfuncKey` and `UfuncInnerLoop`, as well as parsing logic for these entries. UfuncKey has some unusual entries (e.g., CPUScalar) that don't show up in the documentation, more on these below).
* A predicate `is_ufunc_dispatch_key` for testing which dispatch keys should get automatically generated when an operator opts into ufuncs (CPU and CUDA, for now!)
* `tools/codegen/api/types.py`
* More self-explanatory utility stuff: ScalarTypeToCppMapping mapping ScalarType to CppTypes; Binding.rename for changing the name of a binding (used when we assign constructor variables to member variables inside CUDA functors)
* New VectorizedCType, representing `at::vec::Vectorized<T>`. This is used inside vectorized CPU codegen.
* New `scalar_t` and `opmath_t` BaseCppTypes, representing template parameters that we work with when doing codegen inside ufunc kernel loops (e.g., where you previously had Tensor, now you have `scalar_t`)
* `StructuredImplSignature` represents a `TORCH_IMPL_FUNC` definition, and straightforwardly follows from preexisting `tools.codegen.api.structured`
* `tools/codegen/translate.py` - Yes, we use translate a LOT in this PR. I improved some of the documentation, the only substantive changes are adding two new conversions: given a `scalar_t` or a `const Scalar&`, make it convertible to an `opmath_t`
* `tools/codegen/api/ufunc.py`
* OK, now we're at the meaty stuff. This file represents the calling conventions of three important concepts in ufunc codegen, which we'll describe shortly. All of these APIs are relatively simple, since there aren't any complicated types by the time you get to kernels.
* stubs are the DispatchStub trampolines that CPU kernels use to get to their vectorized versions. They drop all Tensor arguments (as they are in TensorIterator) but otherwise match the structured calling convention
* ufuncs are the inner loop template functions that you wrote in ufunc/add.h which do the actual computation in question. Here, all the Tensors and Scalars have been converted into the computation type (`opmath_t` in CUDA, `scalar_t` in CPU)
* ufunctors are a CUDA-only concept representing functors that take some of their arguments on a host-side constructor, and the rest in the device-side apply. Once again, Tensors and Scalars are converted into the computation type, `opmath_t`, but for clarity all the functions take `scalar_t` as argument (as this is the type that is most salient at the call site). Because the constructor and apply are code generated separately, `ufunctor_arguments` returns a teeny struct `UfunctorBindings`
* `tools/codegen/dest/ufunc.py` - the workhorse. This gets its own section below.
* `tools/codegen/gen.py` - just calling out to the new dest.ufunc implementation to generate UfuncCPU_add.cpp, UFuncCPUKernel_add.cpp and UfuncCUDA_add.cu files per ufunc operator. Each of these files does what you expect (small file that registers kernel and calls stub; CPU implementation; CUDA implementation). There is a new file manager for UFuncCPUKernel files as these need to get replicated by cmake for vectorization. One little trick to avoid recompilation is we directly replicate code generated forward declarations in these files, to reduce the number of headers we depend on (this is codegen, we're just doing the preprocessors job!)
* I'll talk about build system adjustments below.
OK, let's talk about tools/codegen/dest/ufunc.py. This file can be roughly understood in two halves: one for CPU code generation, and the other for CUDA code generation.
**CPU codegen.** Here's roughly what we want to generate:
```
// in UfuncCPU_add.cpp
using add_fn = void (*)(TensorIteratorBase&, const at::Scalar&);
DECLARE_DISPATCH(add_fn, add_stub);
DEFINE_DISPATCH(add_stub);
TORCH_IMPL_FUNC(ufunc_add_CPU)
(const at::Tensor& self, const at::Tensor& other, const at::Scalar& alpha, const at::Tensor& out) {
add_stub(device_type(), *this, alpha);
}
// in UfuncCPUKernel_add.cpp
void add_kernel(TensorIteratorBase& iter, const at::Scalar& alpha) {
at::ScalarType st = iter.common_dtype();
RECORD_KERNEL_FUNCTION_DTYPE("add_stub", st);
switch (st) {
AT_PRIVATE_CASE_TYPE("add_stub", at::ScalarType::Bool, bool, [&]() {
auto _s_alpha = alpha.to<scalar_t>();
cpu_kernel(iter, [=](scalar_t self, scalar_t other) {
return ufunc::add(self, other, _s_alpha);
});
})
AT_PRIVATE_CASE_TYPE(
"add_stub", at::ScalarType::ComplexFloat, c10::complex<float>, [&]() {
auto _s_alpha = alpha.to<scalar_t>();
auto _v_alpha = at::vec::Vectorized<scalar_t>(_s_alpha);
cpu_kernel_vec(
iter,
[=](scalar_t self, scalar_t other) {
return ufunc::add(self, other, _s_alpha);
},
[=](at::vec::Vectorized<scalar_t> self,
at::vec::Vectorized<scalar_t> other) {
return ufunc::add(self, other, _v_alpha);
});
})
...
```
The most interesting change about the generated code is what previously was an `AT_DISPATCH` macro invocation is now an unrolled loop. This makes it easier to vary behavior per-dtype (you can see in this example that the entry for bool and float differ) without having to add extra condtionals on top.
Otherwise, to generate this code, we have to hop through several successive API changes:
* In TORCH_IMPL_FUNC(ufunc_add_CPU), go from StructuredImplSignature to StubSignature (call the stub). This is normal argument massaging in the classic translate style.
* In add_kernel, go from StubSignature to UfuncSignature. This is nontrivial, because we must do various conversions outside of the inner kernel loop. These conversions are done by hand, setting up the context appropriately, and then the final ufunc call is done using translate. (BTW, I introduce a new convention here, call on a Signature, for code generating a C++ call, and I think we should try to use this convention elsewhere)
The other piece of nontrivial logic is the reindexing by dtype. This reindexing exists because the native_functions.yaml format is indexed by UfuncKey:
```
Generic: add (AllAndComplex, BFloat16, Half)
ScalarOnly: add (Bool)
```
but when we do code generation, we case on dtype first, and then we generate a `cpu_kernel` or `cpu_kernel_vec` call. We also don't care about CUDA code generation (which Generic) hits. Do this, we lower these keys into two low level keys, CPUScalar and CPUVector, which represent the CPU scalar and CPU vectorized ufuncs, respectively (Generic maps to CPUScalar and CPUVector, while ScalarOnly maps to CPUScalar only). Reindexing then gives us:
```
AllAndComplex:
CPUScalar: add
CPUVector: add
Bool:
CPUScalar: add
...
```
which is a good format for code generation, but too wordy to force native_functions.yaml authors to write. Note that when reindexing, it is possible for there to be a conflicting definition for the same dtype; we just define a precedence order and have one override the other, so that it is easy to specialize on a particular dtype if necessary. Also note that because CPUScalar/CPUVector are part of UfuncKey, technically you can manually specify them in native_functions.yaml, although I don't expect this functionality to be used.
**CUDA codegen.** CUDA code generation has many of the same ideas as CPU codegen, but it needs to know about functors, and stubs are handled slightly differently. Here is what we want to generate:
```
template <typename scalar_t>
struct CUDAFunctorOnSelf_add {
using opmath_t = at::opmath_type<scalar_t>;
opmath_t other_;
opmath_t alpha_;
CUDAFunctorOnSelf_add(opmath_t other, opmath_t alpha)
: other_(other), alpha_(alpha) {}
__device__ scalar_t operator()(scalar_t self) {
return ufunc::add(static_cast<opmath_t>(self), other_, alpha_);
}
};
... two more functors ...
void add_kernel(TensorIteratorBase& iter, const at::Scalar & alpha) {
TensorIteratorBase& iter = *this;
at::ScalarType st = iter.common_dtype();
RECORD_KERNEL_FUNCTION_DTYPE("ufunc_add_CUDA", st);
switch (st) {
AT_PRIVATE_CASE_TYPE("ufunc_add_CUDA", at::ScalarType::Bool, bool, [&]() {
using opmath_t = at::opmath_type<scalar_t>;
if (false) {
} else if (iter.is_cpu_scalar(1)) {
CUDAFunctorOnOther_add<scalar_t> ufunctor(
iter.scalar_value<opmath_t>(1), (alpha).to<opmath_t>());
iter.remove_operand(1);
gpu_kernel(iter, ufunctor);
} else if (iter.is_cpu_scalar(2)) {
CUDAFunctorOnSelf_add<scalar_t> ufunctor(
iter.scalar_value<opmath_t>(2), (alpha).to<opmath_t>());
iter.remove_operand(2);
gpu_kernel(iter, ufunctor);
} else {
gpu_kernel(iter, CUDAFunctor_add<scalar_t>((alpha).to<opmath_t>()));
}
})
...
REGISTER_DISPATCH(add_stub, &add_kernel);
TORCH_IMPL_FUNC(ufunc_add_CUDA)
(const at::Tensor& self,
const at::Tensor& other,
const at::Scalar& alpha,
const at::Tensor& out) {
add_kernel(*this, alpha);
}
```
The functor business is the bulk of the complexity. Like CPU, we decompose CUDA implementation into three low-level keys: CUDAFunctor (normal, all CUDA kernels will have this), and CUDAFunctorOnOther/CUDAFunctorOnScalar (these are to support Tensor-Scalar specializations when the Scalar lives on CPU). Both Generic and ScalarOnly provide ufuncs for CUDAFunctor, but for us to also lift these into Tensor-Scalar specializations, the operator itself must be eligible for Tensor-Scalar specialization. At the moment, this is hardcoded to be all binary operators, but in the future we can use tags in native_functions.yaml to disambiguate (or perhaps expand codegen to handle n-ary operators).
The reindexing process not only reassociates ufuncs by dtype, but it also works out if Tensor-Scalar specializations are needed and codegens the ufunctors necessary for the level of specialization here (`compute_ufunc_cuda_functors`). Generating the actual kernel (`compute_ufunc_cuda_dtype_body`) just consists of, for each specialization, constructing the functor and then passing it off to `gpu_kernel`. Most of the hard work is in functor generation, where we take care to make sure `operator()` has the correct input and output types (which `gpu_kernel` uses to arrange for memory accesses to the actual CUDA tensor; if you get these types wrong, your kernel will still work, it will just run very slowly!)
There is one big subtlety with CUDA codegen: this won't work:
```
Generic: add (AllAndComplex, BFloat16, Half)
ScalarOnly: add_bool (Bool)
```
This is because, even though there are separate Generic/ScalarOnly entries, we only generate a single functor to cover ALL dtypes in this case, and the functor has the ufunc name hardcoded into it. You'll get an error if you try to do this; to fix it, just make sure the ufunc is named the same consistently throughout. In the code, you see this because after testing for the short circuit case (when a user provided the functor themselves), we squash all the generic entries together and assert their ufunc names are the same. Hypothetically, if we generated a separate functor per dtype, we could support differently named ufuncs but... why would you do that to yourself. (One piece of nastiness is that the native_functions.yaml syntax doesn't stop you from shooting yourself in the foot.)
A brief word about CUDA stubs: technically, they are not necessary, as there is no CPU/CPUKernel style split for CUDA kernels (so, if you look, structured impl actually calls add_kernel directly). However, there is some code that still makes use of CUDA stubs (in particular, I use the stub to conveniently reimplement sub in terms of add), so we still register it. This might be worth frying some more at a later point in time.
**Build system changes.** If you are at FB, you should review these changes in fbcode, as there are several changes in files that are not exported to ShipIt.
The build system changes in this patch are substantively complicated by the fact that I have to implement these changes five times:
* OSS cmake build
* OSS Bazel build
* FB fbcode Buck build
* FB xplat Buck build (selective build)
* FB ovrsource Buck build
Due to technical limitations in the xplat Buck build related to selective build, it is required that you list every ufunc header manually (this is done in tools/build_variables.bzl)
The OSS cmake changes are entirely in cmake/Codegen.cmake there is a new set of files cpu_vec_generated (corresponding to UfuncCPUKernel files) which is wired up in the same way as other files. These files are different because they need to get compiled multiple times under different vectorization settings. I adjust the codegen, slightly refactoring the inner loop into its own function so I can use different base path calculation depending on if the file is traditional (in the native/cpu folder) or generated (new stuff from this diff.
The Bazel/Buck changes are organized around tools/build_variables.bzl, which contain the canonical list of ufunc headers (aten_ufunc_headers), and tools/ufunc_defs.bzl (added to ShipIt export list in D34465699) which defines a number of functions that compute the generated cpu, cpu kernel and cuda files based on the headers list. For convenience, these functions take a genpattern (a string with a {} for interpolation) which can be used to easily reformat the list of formats in target form, which is commonly needed in the build systems.
The split between build_variables.bzl and ufunc_defs.bzl is required because build_variables.bzl is executed by a conventional Python interpreter as part of the OSS cmake, but we require Skylark features to implement the functions in ufunc_defs.bzl (I did some quick Googling but didn't find a lightweight way to run the Skylark interpreter in open source.)
With these new file lists, the rest of the build changes are mostly inserting references to these files wherever necessary; in particular, cpu kernel files have to be worked into the multiple vectorization build flow (intern_build_aten_ops in OSS Bazel). Most of the subtlety relates to selective build. Selective build requires operator files to be copied per overall selective build; as dhruvbird explains to me, glob expansion happens during the action graph phase, but the selective build handling of TEMPLATE_SOURCE_LIST is referencing the target graph. In other words, we can't use a glob to generate deps for another rule, because we need to copy files from wherever (included generated files) to a staging folder so the rules can pick them up.
It can be somewhat confusing to understand which bzl files are associated with which build. Here are the relevant mappings for files I edited:
* Used by everyone - tools/build_tools.bzl, tools/ufunc_defs.bzl
* OSS Bazel - aten.bzl, BUILD.bazel
* FB fbcode Buck - TARGETS
* FB xplat Buck -BUCK, pt_defs.bzl, pt_template_srcs.bzl
* FB ovrsource Buck - ovrsource_defs.bzl, pt_defs.bzl
Note that pt_defs.bzl is used by both xplat and ovrsource. This leads to the "tiresome" handling for enabled backends, as selective build is CPU only, but ovrsource is CPU and CUDA.
BTW, while I was at it, I beefed up fb/build_arvr.sh to also do a CUDA ovrsource build, which was not triggered previously.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31306586
Pulled By: ezyang
fbshipit-source-id: 210258ce83f578f79cf91b77bfaeac34945a00c6
(cherry picked from commit d65157b0b894b6701ee062f05a5f57790a06c91c)
Summary:
- Target Sha1: ae108ef49aa5623b896fc93d4298c49d1750d9ba
- Make USE_XNNPACK a dependent option on cmake minimum version 3.12
- Print USE_XNNPACK under cmake options summary, and print the
availability from collet_env.py
- Skip XNNPACK based tests when XNNPACK is not available
- Add SkipIfNoXNNPACK wrapper to skip tests
- Update cmake version for xenial-py3.7-gcc5.4 image to 3.12.4
- This is required for the backwards compatibility test.
The PyTorch op schema is XNNPACK dependent. See,
aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp for
example. The nightly version is assumed to have USE_XNNPACK=ON,
so with this change we ensure that the test build can also
have XNNPACK.
- HACK: skipping test_xnnpack_integration tests on ROCM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72642
Reviewed By: kimishpatel
Differential Revision: D34456794
Pulled By: digantdesai
fbshipit-source-id: 85dbfe0211de7846d8a84321b14fdb061cd6c037
(cherry picked from commit 6cf48e7b64d6979962d701b5d493998262cc8bfa)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72869
The ordering here doesn't really matter, but in a future patch
I will make a change where vectorized CPU codegen does have to
be here, and moving it ahead of time (with no code changes)
will make the latter diff cleaner.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D34282229
Pulled By: ezyang
fbshipit-source-id: 3397cb0e062d63cc9853f6248f17c3558013798b
(cherry picked from commit 98c616024969f9df90c7fb09741ed9be7b7a20f1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72761
By default, the CUPTI_INCLUDE_DIR will pick up cupti.h from /usr/include which is old (from 2017 on AWS), and missing many cupti headers. Use NO_DEFAULT_PATH to avoid that, instead search from the list of locations provided.
Test Plan:
Fixes missing headers error when building on AWS. (Avoids old cupti.h from /usr/include). Instead uses cupti.h from cuda/extras/CUPTI/include.
```
In file included from /scratch/aaronshi/pytorch/third_party/kineto/libkineto/src/CuptiRangeProfilerApi.cpp:13:0:
/scratch/aaronshi/pytorch/third_party/kineto/libkineto/src/CuptiRangeProfilerApi.h:12:10: fatal error: cupti_profiler_target.h: No such file or directory
#include <cupti_profiler_target.h>
^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
```
and
```
/scratch/aaronshi/pytorch/third_party/kineto/libkineto/src/CuptiRangeProfilerApi.cpp:7:10: fatal error: nvperf_host.h: No such file or directory
#include <nvperf_host.h>
^~~~~~~~~~~~~~~
compilation terminated.
```
Reviewed By: briancoutinho
Differential Revision: D34191123
Pulled By: aaronenyeshi
fbshipit-source-id: d84f80308c9939ba8ed504e667847d136a261453
(cherry picked from commit 33368bd93b)
Summary:
In ROCm 5.0 and later the version of the ROCm platform can be obtained via
an api call vs reading from a flat file.
If the header file /opt/rocm/include/rocm_version.h exists,
LoadHIP.cmake compiles source referencing the api and prints out the
ROCM Versions.
If the file does not exist, LoadHIP.cmake will revert to the previous
approach of looking for the version-dev file.
Fixes #{issue number}
cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69481
Reviewed By: seemethere, janeyx99
Differential Revision: D34153435
Pulled By: malfet
fbshipit-source-id: f8c0650d27666d2a3cf47d812807798c47210b37
(cherry picked from commit 6cbb4f7a0c)
Summary:
`include_directories` is old-style CMake which adds the include path to every file being compiled. This instead makes `python`, `numpy` and `pybind11` into targets that only `torch_python` and `caffe2_pybind_state` are linked to. So, python libraries can't be accidentally included elsewhere.
Resubmit of https://github.com/pytorch/pytorch/issues/65654, Closes https://github.com/pytorch/pytorch/issues/65828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69085
Reviewed By: anjali411
Differential Revision: D33776456
Pulled By: malfet
fbshipit-source-id: 018b0f6cd5a4f8c9e36df961deff832bc4afd479
(cherry picked from commit 57063107d6)
Summary:
Remove forcing CUDNN_STATIC when CAFFE2_STATIC_LINK_CUDA is set
Since we are transitioning to using dynamic loading for multiple pytorch dependecies and CUDNN is the first step in this transition, hence we want to remove forcing CUDNN to statically load, and instead load it dynamically.
Tested using following workflow:
https://github.com/pytorch/pytorch/actions/runs/1790666862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72290
Reviewed By: albanD
Differential Revision: D34003793
Pulled By: atalman
fbshipit-source-id: 41bda7ac019a612ee53ceb18d1e372b1bb3cb68e
(cherry picked from commit 4a01940e68)
Summary:
This PR upgrades oneDNN to v2.5.2, and includes some building support for oneDNN v2.5.2.
v2.4 changes:
- Improved performance for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
- Improved binary primitive performance for cases when one of the tensors is broadcasted.
- Improved performance of reduction primitive, reorder, shuffle primitives.
- Improved performance of depthwise convolution forward propagation for processors with Intel AVX5-12 support
- Improved performance of forward inner product primitive for the shapes with minibatch equal to 1 for processors with Intel AVX-512 support
- Improved performance of int8 matmul and inner product primitives for processors with Intel AVX2 and Intel DL Boost support
v2.5 changes:
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). The functionality is now enabled by default and requires Linux kernel 5.16.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.
v2.5.2 changes:
- Fixed performance regression in binary primitive with broadcast
- Fixed segmentation fault in depthwise convolution primitive for shapes with huge spatial size for processors with Intel AVX-512 support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71546
Reviewed By: george-qi
Differential Revision: D33827108
Pulled By: VitalyFedyunin
fbshipit-source-id: 8f5a19b331c82af5b0783f081e061e1034a93952
(cherry picked from commit 9705212fe9)