Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds.
Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase.
The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars here:
6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)
Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869
Do not try to fix sign compare violations in caffe2 codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723
Approved by: https://github.com/albanD
Summary:
Usage of fast math in BatchBoxCox kernel provided different math results between dev and optimized versions which cause few internal test to fail.
For now disabling the compiler optimized version and relying on ATEN vectors
Differential Revision: D41211784
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88875
Approved by: https://github.com/hyuen
Summary:
1) Adding MKL/AVX2 based implementation into perfkernels. This implementation is similar to caffe2/operators/batch_box_cox_op.cc
2) Migrating batch_box_cox_op of caffe2 use this implementation
Test Plan: CI
Differential Revision: D40208074
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86569
Approved by: https://github.com/hyuen
We're no longer building Caffe2 mobile as part of our CI, and it adds a lot of clutter to our make files. Any lingering internal dependencies will use the buck build and so wont be effected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84338
Approved by: https://github.com/dreiss
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70248
Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D32813863
fbshipit-source-id: 527244b4a2b220fdfe7f17dee3599603f492a2ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66743
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D31705359
fbshipit-source-id: c9ea2fbc0f9cd29e97a52dcb203addc5f2abb09b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
This PR deletes some code in `MiscCheck.cmake` that perform the exact
same functionality as `FindAVX.cmake`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61748
Reviewed By: ejguan
Differential Revision: D29791282
Pulled By: malfet
fbshipit-source-id: 6595fd1b61c8ae12b821fad8c9a34892dd52d213
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60677
Add a rule to wrap conversions.h and depend on that, rather than
relying on a glob which violates package boundaries.
Test Plan: `buck2 build fbcode//caffe2/caffe2:caffe2_core`
Reviewed By: mzlee
Differential Revision: D29370841
fbshipit-source-id: d4dd383eb8457d4f5118574e34e6f17c32fde647
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
* `#if` with some undefined name is a warning when `-Wundef` is specified (which is in ovrsource for example)
* identifiers starting with two underscores are [reserved for compiler internals](https://en.cppreference.com/w/cpp/language/identifiers)
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D27318070
fbshipit-source-id: 4989fc6a3bf3c176eddd7c25aca47414e4973edd
Summary:
Follow up PR of https://github.com/pytorch/pytorch/issues/53951.
This PR fixes remaining semmle warning: comparison of narrow type with wide type in loop condition
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54471
Reviewed By: bdhirsh
Differential Revision: D27262493
Pulled By: malfet
fbshipit-source-id: 05765758da79699936af11de237c3ff3d34373d6
Summary:
fix Semmle warning: Comparison of narrow type with wide type in loop condition
For example there is below piece of code:
for (int i=0; i<array.size(); ++i) {}
The problem is that array.size() return type is size_t can be larger type than int depending on the implementation so there is chance that i overflows (for very large array that array size is beyond the range of integer) and this loop will never be terminated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53951
Reviewed By: zou3519
Differential Revision: D27181495
Pulled By: malfet
fbshipit-source-id: 0612c5cedcdc656c193085e7fbb87dd163f20688
Summary:
When build libtorch static library, these three static libraries will be generated but won't be installed to CMAKE_INSTALL_LIBDIR:
- libCaffe2_perfkernels_avx2.a
- libCaffe2_perfkernels_avx512.a
- libCaffe2_perfkernels_avx.a
This PR will fix this issue.
Please be noted that after this fix there still have static libraries missing in CMAKE_INSTALL_LIBDIR, but they belong to third_party repo, and we need to fix in the corresponding repo:
- libfoxi_loader.a
- libonnx.a
- libonnx_proto.a
- libfmt.a
- libnccl_static.a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53825
Reviewed By: ngimel
Differential Revision: D27013844
Pulled By: malfet
fbshipit-source-id: 8a84cc72b6ae87393ca26c4e474f5526a7b18ab2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758
It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.
Test Plan: unit tests
Reviewed By: ngimel
Differential Revision: D24470808
fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
Summary: I think this preprocessor check is incorrect. The fused multiply-add (FMA) instructions are not part of AVX2.
Test Plan: CI
Reviewed By: jspark1105
Differential Revision: D24237836
fbshipit-source-id: 44f9b9179918332eb85ac087827726300f56224e
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/387
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39985
avx2 optimized 2/4-bit row-wise quantization/dequantization in perfkernels.
This diff slightly change the numerics of quantization by multiplying with the inverse of scale instead of dividing with scale.
Test Plan:
In my devserver
for i in 2 4 8; do echo $i; buck run mode/opt :fused_rowwise_nbit_conversion_bench -- --bit-rate=$i; done
Before this diff
2-bit
3.35394 ms. 100%. FloatToFused2BitRowwiseQuantized
4-bit
3.60351 ms. 100%. FloatToFused4BitRowwiseQuantized
8-bit
0.434467 ms. 100%. FloatToFused8BitRowwiseQuantized
After this diff
2-bit
0.606386 ms. 100%. FloatToFused2BitRowwiseQuantized
4-bit
0.446683 ms. 100%. FloatToFused4BitRowwiseQuantized
8-bit
0.4349 ms. 100%. FloatToFused8BitRowwiseQuantized
Reviewed By: choudharydhruv, jianyuh
Differential Revision: D22033195
fbshipit-source-id: d3a219e47b8345268d90a160c9314ed0d5b71467
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37372
Posted note: [Regularizing SparseNN Against Over-fitting](https://fb.workplace.com/notes/taiqing-wang/regularizing-sparsenn-against-over-fitting/220306075902708/)
**Problem formulation**
L(w) = J(w) + lambda/2 * ||w||^2
J(w) is the empirical loss, and ||w||^2 is the squared L2 norm of the parameters, a.k.a. L2 regularizer.
dL(w)/ dw_i = dJ(w)/dw_i + lambda w_i
dL(w)/ dw_i is the gradient of L(w) w.r.t. w_i.
To implement the L2 regularizer, the gradient of J(w) w.r.t. w_i is added with w_i. lambda is called as weight decay in this implementation.
**Code changes**
* In the initialization method of AdagradOptimizer, a new input argument, weight_decay, is added.
* In the _run function of AdagradOptimizer, the weight decay will be skipped for 1d bias vectors.
* In the parameter update functions of Adagrad, the gradient is updated by weight_decay * w_i. The default value for weight_decay is zero.
Test Plan:
`
buck build caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay
`
`
./buck-out/gen/caffe2/caffe2/fb/dper/layer_models/tests/split_1/sparse_nn_test_weight_decay#binary.par
`
Reviewed By: jspark1105
Differential Revision: D21258652
fbshipit-source-id: d2366ddcd736a03205a2d16f914703b16d9fce8f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36371
It allows to drop circular dependency and remove unknown_symbols in Buck build.
It'd be good to get rid of GetCpuId all together in favor of cpuinfo, but it's not really blocking anything
Reviewed By: malfet
Differential Revision: D20958000
fbshipit-source-id: ed17a2a90a51dc1adf9e634af56c85f0689f8f29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35556
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35542
Apply explicit vectorization to lstm_unit operator.
Enabled by -DENABLE_VECTORIZATION=1
This optimization requires vector library support and was tested with Intel SVML & clang.
However, compiler which support OpenMP4.5 with omp simd extention should also benefit.
After the code changes
In file included from caffe2/caffe2/operators/lstm_unit_op.cc:1:
caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {
caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
caffe2/caffe2/operators/lstm_unit_op.h:112:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {
Test Plan:
Check failures at OSS CI
- No build failures related to this change
- Failing tests are:
- py3.6-clang7-rocmdeb-ubuntu16.04-test2
>RuntimeError: fft: ATen not compiled with MKL support
- caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test -
>gradient_check_test.py::TestMakeTwo
Exited with code exit status 1
- pytorch_macos_10_13_py3_test , Test errors like:
> ERROR [0.014s]: test_boolean_indexing_weirdness_cpu (__main__.NumpyTestsCPU)
RuntimeError: shape mismatch: indexing tensors could not be broadcast together with shapes [0], [2]
- caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test
- No failure info
Reviewed By: jspark1105
Differential Revision: D20484640
fbshipit-source-id: 8fb82dbd6698c8de3e0bbbc0b48d15b70e36ca94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32974
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/286
Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad
There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad.
The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row
And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit)
Test Plan: CI
Reviewed By: wx1988
Differential Revision: D19342865
fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32683
Pull Request resolved: https://github.com/pytorch/glow/pull/4079
Similar to D17768404, we changed the EmbeddingBag operator for 8-bit fused version to add the option to include the last offset and parallelize the op.
ghstack-source-id: 97404645
Test Plan:
To generate the AVX2 code (`embedding_lookup_fused_8bit_rowwise_idx_avx2.cc`):
```
python hp_emblookup_codegen.py --fused --use-offsets
```
To test the correctness:
```
buck test //caffe2/torch/fb/sparsenn:test -- test_embedding_bag_byte_rowwise_offsets --print-passing-details
```
Reviewed By: yinghai
Differential Revision: D19592761
fbshipit-source-id: f009d675ea3f2228f62e9f86b7ccb94700a0dfe0
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4049
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477
We would like to add the intra-op parallelization support for the EmbeddingBag operator.
This should bring speedup for the DLRM benchmark:
https://github.com/pytorch/pytorch/pull/24385
Benchmark code:
```
from __future__ import absolute_import, division, print_function, unicode_literals
import torch
import time
eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum')
input = torch.LongTensor(1500).random_(0, 1000000)
offsets = torch.zeros(64, dtype=torch.int64)
niter = 10000
s = time.time()
for _ in range(niter):
out = eb(input, offsets)
time_per_iter = (time.time() - s) / niter
print('time_per_iter', time_per_iter)
print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9)
```
The following results are single core on Skylake T6:
- Before our change (with the original caffe2::EmbeddingLookup)
time_per_iter 6.313693523406982e-05
GB/s 6.341517821789133
- After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths.
time_per_iter 5.7627105712890626e-05
GB/s 6.947841559053659
- With Intel's PR: https://github.com/pytorch/pytorch/pull/24385
time_per_iter 7.393271923065185e-05
GB/s 5.415518381664018
For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6.
ghstack-source-id: 97124557
Test Plan:
With D16990830:
```
buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench
```
With D17750961:
```
buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb
buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb
```
OSS test
```
python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
```
Buck test
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"
OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets" --print-passing-details
```
Generate the AVX2 code for embedding_lookup_idx_avx2.cc:
```
python hp_emblookup_codegen.py --use-offsets
```
Differential Revision: D17768404
fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31470
Optimize performance of these two operators.
Additionally use nearbyint instead of round to be consistent with 4-bit embedding table quantization.
Reviewed By: hyuen
Differential Revision: D19072103
fbshipit-source-id: efe96f14aeff7958cceb453ed625d3fd693891ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30737
Original commit changeset: 2a8b2a3f5401
Reverting this to be safe until we address test failures in T58528495
Test Plan: CI
Reviewed By: wx1988
Differential Revision: D18812384
fbshipit-source-id: 2a3ac554024773022ec827f259127e4c8cffe6e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30449
There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad.
In this diff we first compute effective_lr = lr / (sqrt(moment) + epsilon) and then multiply with gradient.
Test Plan: CI
Reviewed By: protonu
Differential Revision: D18703416
fbshipit-source-id: 2a8b2a3f5401466549561412bd22f07abac3c598
Summary:
We (me fnabulsi bmcdb) have a handful of fixes used locally to build and run with clang-cl. I am aware of https://github.com/pytorch/pytorch/issues/8784 but it has not been touched in almost a year.
It may be more practical to upstream the non-controversial fixes piecewise. For example, this one.
Here, the dummy version of `_cvtsh_ss` for MSVC is not required (and hence causes conflicts) when using clang-cl so can be #ifdef'd out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29726
Differential Revision: D18478120
Pulled By: ezyang
fbshipit-source-id: cdcd94251e68347446f2ad1ac5a0e71089f7d0ab