Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56816
This doesn't actually work. For some reason the linker can't find
at::cpu::logit_out, and it's not worth digging into why not.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D27977406
Pulled By: bertmaher
fbshipit-source-id: d0235a393f25243e2c8a011e9baf267daf483ae4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56455
CPU convolution performance is pretty important for inference, so
tracking performance for CNNs often boils down to finding shapes that have
either regressed or need optimization. This diff adds a benchmark harness that
lets you pretty easily add new sets of convolution parameters to benchmark.
I've started with an exhaustive list of layers from MobileNetV3, ResNet-18 and
ResNet-50, which are fairly popular torchvision models. More to come if these
prove useful.
I've also added four backend configurations:
- native: uses at::conv2d, which applies its own backend selection heuristics
- mkldnn_none: uses mkldnn but applies no prepacking; uses the NCHW default
- mkldnn_weight: prepacks weights in an mkldnn-friendly format
- mkldnn_input: also prepacks the inputs in NCHW16c
ghstack-source-id: 127027784
Test Plan: Ran this on my Skylake Xeon
Reviewed By: ngimel
Differential Revision: D27876139
fbshipit-source-id: 950e1dfa09a33cc3acc7efd579f56df8453af1f2
Summary:
There is a build failure in `bench_approx.cpp` due to namespace change for log_out and tanh_out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56278
Reviewed By: bertmaher, nikithamalgifb
Differential Revision: D27825621
Pulled By: navahgar
fbshipit-source-id: 0bccd324af92a3460610bf475514449f0223de2b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825
The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.
Differential Revision: D27717776
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55824
Seemingly some of my last changes (namely, removing dep-tracker) broke
the TE benchmarks. This PR fixes it.
Differential Revision: D27717778
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 48584bc0cfd4879a3e44cb45ee1f0d5c91b5afbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55324
With this change `rfactor` only affects the passed loop and its body
never touching anything outside (that was a rootcause of a bug with the
previous implementation). Also, we don't have an `insertion_point`
parameter anymore - its meaning was vague, and the effect of it
should've been achievable with other transformations anyway.
The new `rfactor` semantics is as follows:
```
Requirements:
* S is the reduction store
* S is the only statement in the innermost loop
* There is at least two reduction arguments in S
* OUTER_REDUCTION_FOR loop corresponds to the outermost reduction variable
used in the store and all other reduction variables are index variables of
children loops of OUTER_REDUCTION_FOR
* OUTER_REDUCTION_FOR is a perfect loop nest, i.e. it has only loops
corresponding to the other reduction variables and the store, nested into
each other
What it does:
* Introduce a new buffer with an extra dimension of a size equal to the
span of the loop OUTER_REDUCTION_FOR (the new buffer is returned via
RFAC_BUF_PTR)
* Insert an initialization store for the new buffer in
OUTER_REDUCTION_FOR before its nested loop
* Replace the reduction store to the original buffer with the reduction
store to the temp buffer, removing the index var of OUTER_REDUCTION_FOR
from reduction arguments
* Insert a final reduction store over the extra dimension of the new
buffer to the original buffer
* Returns TRUE if the transformation succeeded and FALSE otherwise
Example:
Original IR:
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis
S4: for k # reduction axis
S5: X[i] = ReduceOp(X[i] + Y[i,j,k], reduce_axis={j,k})
After RFACTOR(S5, S3)
S1: for i # normal axis
S2: X[i] = 0
S3: for j # reduction axis for X, normal axis for X_rfac
X_rfac[i,j] = 0
S4: for k # reduction axis
X_rfac[i,j] = ReduceOp(X_rfac[i,j] + Y[i,j,k], reduce_axis={k})
X[i] = ReduceOp(X[i] + X_rfac[i,j], reduce_axis={j})
```
Differential Revision: D27694960
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 076fa6a1df2c23f5948302aa6b43e82cb222901c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54997
DepTracker was used to automatically pull in dependent computations from
output ones. While it seems quite convenient, it's led to several
architectural issues, which are fixed in this stack.
DepTracker worked on Tensors, which is a pair of Buf and Stmt. However,
Stmt could become stale and there was no way to reliably update the
corresponding tensor. We're now using Bufs and Stmts directly and moving
away from using Tensors to avoid these problems.
Removing DepTracker allowed to unify Loads and FunctionCalls, which
essentially were duplicates of each other.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D27446414
Pulled By: ZolotukhinM
fbshipit-source-id: a2a32749d5b28beed92a601da33d126c0a2cf399
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995
This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.
LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038223
Pulled By: ZolotukhinM
fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193
* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.
Test plans:
* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D
Measurements:
On a Broadwell E5-2686 CPU,
Reduce1D/Torch/16777216 5638547 ns 5638444 ns 119 BYTES=11.902G/s
Reduce1D/Naive/16777216 19308235 ns 19308184 ns 36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216 8433348 ns 8433038 ns 85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216 5608836 ns 5608727 ns 124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216 5550233 ns 5550221 ns 126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216 21451047 ns 21450752 ns 33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216 23701732 ns 23701229 ns 30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216 23683589 ns 23682978 ns 30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216 5378019 ns 5377909 ns 131 BYTES=12.4786G/s
Result summary:
* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.
Follow-up items:
* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
* Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25821880
Pulled By: zheng-xq
fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543
Original commit changeset: 2d2f07f79986
Was part of a stack that got reverted. This is just a benchmark.
ghstack-source-id: 119825594
Test Plan: CI
Reviewed By: navahgar
Differential Revision: D25912439
fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676
Summary:
This is a fast log implementations
benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat
Reviewed By: bertmaher
Differential Revision: D25445815
fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124
We want to make sure we can actually fuse kernels within a fairly
tight time budget. So here's a quick benchmark of codegen for a simple
pointwise activation function (swish). I kept all the intermediate tensors
separate to force TE to actually do inlining.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```
I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.
Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish 5123276 ns 5119846 ns 148
BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160
```
Reviewed By: asuhan
Differential Revision: D24232801
fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875
Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).
Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).
Right now there's just an unoptimized implementation that is expected to be not
very fast. More optimized versions are coming.
Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 256K (x24)
L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s
```
Test Plan: Imported from OSS
Reviewed By: SplitInfinity
Differential Revision: D24142403
Pulled By: bertmaher
fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597