Summary:
Clang from XCode does not support `-fopenmp` option, no need to try to compile with it.
Infer whether OpenMP is supported by checking _OPENMP define.
Also, use clang compiler if host app was compiled with clang rather than gcc.
Fix few range loop warnings and add static_asserts that range loop variables are raw pointers.
This changes makes fuser tests on OS X a bit faster.
Before:
```
% python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp'
clang: error: unsupported option '-fopenmp'
warning: pytorch jit fuser failed to compile with openmp, trying without it...
ok
----------------------------------------------------------------------
Ran 1 test in 0.468s
OK
```
After:
```
% python3 test_jit.py -v TestScript.test_batchnorm_fuser_cpu
Fail to import hypothesis in common_utils, tests are not derandomized
CUDA not available, skipping tests
test_batchnorm_fuser_cpu (__main__.TestScript) ... ok
----------------------------------------------------------------------
Ran 1 test in 0.435s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51504
Reviewed By: smessmer
Differential Revision: D26186875
Pulled By: malfet
fbshipit-source-id: 930b3bcf543fdfad0f493d687072aaaf5f9e2bfc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50228
`fastmod -m 'expect(<((at|c10)::)?\w+Type>\(\)\s*)->'
'expectRef${1}.'`
Presuming it builds, this is a safe change: the result of `expect()`
wasn't being saved anywhere, so we didn't need it, so we can take a
reference instead of a new `shared_ptr`.
ghstack-source-id: 119782961
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D25837374
fbshipit-source-id: 86757b70b1520e3dbaa141001e7976400cdd3b08
Summary:
fmax/fmin propagate the number if one argument is NaN, which doesn't match the eager mode behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43590
Reviewed By: mruberry
Differential Revision: D23338664
Pulled By: bertmaher
fbshipit-source-id: b0316a6f01fcf8946ba77621efa18f339379b2d0
Summary:
JIT pointwise kernel currently does not do vectorized load/store, which may lead to not optimal performance in shorter data types, like half and int8.
In this PR, a fixed length of 4 elements per load/store is added for supported tensor shape, implemented as a runtime check inside kernel.
Supported tensor shape:
- all input/output data point are aligned to 4*sizeof(dtype)
- last dimension contiguous(stride 1) and size is multiple of 4
- all other dimension have stride that is multiple of 4
All test_jit* passed, and here is performance result on a simple `ax+by+c` fusion
result before PR:
```
torch.float32 kernel time: 0.748 ms.
torch.float16 kernel time: 0.423 ms.
torch.int8 kernel time: 0.268 ms.
```
result after PR:
```
torch.float32 kernel time: 0.733 ms.
torch.float16 kernel time: 0.363 ms.
torch.int8 kernel time: 0.191 ms.
```
test code:
```
import torch
import time
# disable profiling to test all data types
torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)
torch.jit.script
def axpby(x, y):
return x * 2 - y * 3 + 1
for test_dtype in [torch.float32, torch.float16, torch.int8]:
a = torch.randn(12345,4096, device="cuda").to(test_dtype)
b = torch.randn(12345,4096, device="cuda").to(test_dtype)
# warm up
for _ in range(100):
c = axpby(a,b)
torch.cuda.synchronize()
start = time.time()
for _ in range(1000):
c = axpby(a,b)
torch.cuda.synchronize()
end = time.time()
print("{} kernel time: {:.3f} ms.".format(test_dtype, end-start))
```
Generated code:
[log_with_generated_code.txt](https://github.com/pytorch/pytorch/files/4472813/log_with_generated_code.txt)
Additional note:
double type is disabled from vectorized code path.
We can later improve it with dynamic vectorization length support and less in-kernel check when we can use tensor shape information in codegen. For now, this implementation is following cache through TensorDesc mechanism, which does not have enough compile time information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36555
Differential Revision: D21142762
Pulled By: ngimel
fbshipit-source-id: 1cfdc5807a944c4670b040dc2d2dfa480377e7d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35115
This commit runs the newly added tools/clang_format.py on the JIT
codebase and includes all of the formatting changes thus produced.
Testing:
Ran the script, CI.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D20568523
Pulled By: SplitInfinity
fbshipit-source-id: e09bdb982ccf090eecfb7c7b461b8d0681eef82b