Commit Graph

105 Commits

Author SHA1 Message Date
PyTorch MergeBot
b5594f7df0 Revert "Use missing-prototypes in torch_cpu (#103725)"
This reverts commit 716b3b893d.

Reverted https://github.com/pytorch/pytorch/pull/103725 on behalf of https://github.com/osalpekar due to Broke caffe2 builds due. More info at [D46920675](https://www.internalfb.com/diff/D46920675) ([comment](https://github.com/pytorch/pytorch/pull/103725#issuecomment-1603129273))
2023-06-22 18:30:31 +00:00
cyy
716b3b893d Use missing-prototypes in torch_cpu (#103725)
This PR enables  Wmissing-prototypes in torch_cpu except some generated cpp files and the mps and metal backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103725
Approved by: https://github.com/albanD
2023-06-21 13:19:55 +00:00
Nikita Shulga
a229e78544 [BE] Enforce sign-compare (#96723)
Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds.
Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase.

The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars  here:
6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)

Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869
Do not try to fix sign compare violations in caffe2 codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723
Approved by: https://github.com/albanD
2023-03-15 06:04:20 +00:00
haozhe.zhu
7cd6e6acad add bf16 in fp32 out fast path for embedingbag in caffe2 perfkernel (#89198)
Add BF16 in FP32 out kernel into Caffe2 emb perfkernels. And also update the python code-gen files to generate the kernel.
The ut will be covered in the next PR(#89199) in this stack ( Tested by nn.EmbeddingBag with BF16 data type)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89198
Approved by: https://github.com/jgong5, https://github.com/kit1980
2022-11-30 13:06:13 +00:00
efiks
ea0ec9d71c [tourch] BatchBoxCox - fix numerical issue in vectorized code (#88875)
Summary:
Usage of fast math in BatchBoxCox kernel provided different math results between dev and optimized versions which cause few internal test to fail.
For now disabling the compiler optimized version and relying on ATEN vectors

Differential Revision: D41211784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88875
Approved by: https://github.com/hyuen
2022-11-11 21:58:23 +00:00
efiks
dcefea2706 [caffe2][tourch] Optimize BatchBoxCox (#87585)
Differential Revision: D40215424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87585
Approved by: https://github.com/hyuen
2022-11-10 06:11:05 +00:00
efiks
2e4c89eba9 [torch] Unify batch_box_cox implementations into perfkernels folder (#86569)
Summary:
1) Adding MKL/AVX2 based implementation into perfkernels. This implementation is similar to caffe2/operators/batch_box_cox_op.cc
2) Migrating batch_box_cox_op of caffe2 use this implementation

Test Plan: CI

Differential Revision: D40208074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86569
Approved by: https://github.com/hyuen
2022-10-23 19:29:25 +00:00
John Detloff
e0229d6517 Remove caffe2 mobile (#84338)
We're no longer building Caffe2 mobile as part of our CI, and it adds a lot of clutter to our make files. Any lingering internal dependencies will use the buck build and so wont be effected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84338
Approved by: https://github.com/dreiss
2022-09-08 01:49:55 +00:00
Nikita Shulga
6d85e7dafa Fix sign-compare in caffe2
Prerequisite change for enabling `-Werror=sign-compare` across PyTorch repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75082

Approved by: https://github.com/ngimel
2022-04-05 00:08:05 +00:00
Richard Barnes
1622546050 use irange for loops (#70248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70248

Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D32813863

fbshipit-source-id: 527244b4a2b220fdfe7f17dee3599603f492a2ca
2022-01-06 23:14:29 -08:00
Rick Weyrauch
8acd0a8b2f Allow row sizes to support int64/size_t. (#69303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69303

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/792

Follow up to D32715453 (e60fd10659), allowing row size to be 64-bit.

Test Plan:
buck test mode/opt -c fbcode.caffe2_gpu_type=v100,a100 //deeplearning/fbgemm/fbgemm_gpu:quantize_ops_test
   buck test mode/opt -c fbcode.caffe2_gpu_type=none //deeplearning/fbgemm/fbgemm_gpu:quantize_ops_test
   buck test mode/opt //caffe2/test:

Reviewed By: jspark1105, jianyuh

Differential Revision: D32768838

fbshipit-source-id: 9e2b01d8d23e71f8333820e725379c3fc1c0711a
2021-12-14 10:09:08 -08:00
Ramanpreet Nara
f587267dc7 Revert D31705359: use irange for loops 8
Test Plan: revert-hammer

Differential Revision:
D31705359 (17e5200441)

Original commit changeset: c9ea2fbc0f9c

fbshipit-source-id: 08fff2d12beca953ad30dd0baabf86e39ac84f14
2021-12-02 12:55:08 -08:00
Richard Barnes
17e5200441 use irange for loops 8 (#66743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66743

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705359

fbshipit-source-id: c9ea2fbc0f9cd29e97a52dcb203addc5f2abb09b
2021-12-02 10:21:29 -08:00
Shashank Chaudhry
06d1be2447 [NOOP][clangformat][codemod] Enable CLANGFORMAT for caffe2/caffe2/* (#67624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67624

Test Plan: Visual inspection. Sandcastle.

Reviewed By: malfet

Differential Revision: D31986628

fbshipit-source-id: c872bded7325997a2945dbf5d4d052628dcb3659
2021-11-02 22:14:04 -07:00
Xue Li
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
Richard Barnes
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
Shen Li
1022443168 Revert D30279364: [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: revert-hammer

Differential Revision:
D30279364 (b004307252)

Original commit changeset: c1ed77dfe43a

fbshipit-source-id: eab50857675c51e0088391af06ec0ecb14e2347e
2021-08-12 11:45:01 -07:00
Zsolt Dollenstein
b004307252 [codemod][lint][fbcode/c*] Enable BLACK by default
Test Plan: manual inspection & sandcastle

Reviewed By: zertosh

Differential Revision: D30279364

fbshipit-source-id: c1ed77dfe43a3bde358f92737cd5535ae5d13c9a
2021-08-12 10:58:35 -07:00
Hong Xu
7acb8b71e1 Remove AVX detection code that duplicates FindAVX.cmake (#61748)
Summary:
This PR deletes some code in `MiscCheck.cmake` that perform the exact
same functionality as `FindAVX.cmake`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61748

Reviewed By: ejguan

Differential Revision: D29791282

Pulled By: malfet

fbshipit-source-id: 6595fd1b61c8ae12b821fad8c9a34892dd52d213
2021-07-20 14:34:36 -07:00
Andrew Gallagher
03de807d81 [caffe2/utils] Add explicit rule to avoid package boundary violation (#60677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60677

Add a rule to wrap conversions.h and depend on that, rather than
relying on a glob which violates package boundaries.

Test Plan: `buck2 build fbcode//caffe2/caffe2:caffe2_core`

Reviewed By: mzlee

Differential Revision: D29370841

fbshipit-source-id: d4dd383eb8457d4f5118574e34e6f17c32fde647
2021-06-28 14:43:30 -07:00
Cao Gao
bac6bcd6d8 Update call site for FBGemm quantization util functions. (#624)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59637

Replace FloatToFusedNBitRowwiseQuantizedSBHalf, FusedNBitRowwiseQuantizedSBHalfToFloat, FloatToFused8BitRowwiseQuantizedSBFloat, and Fused8BitRowwiseQuantizedSBFloatToFloat with newer version.

Test Plan: CI tests.

Reviewed By: dskhudia

Differential Revision: D28918581

fbshipit-source-id: a21274add71439c5e51287a0e2ec918a8d8e5392
2021-06-16 10:15:34 -07:00
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Stiopa Koltsov
547346d663 [caffe2] Fix -Wundef
Summary:
* `#if` with some undefined name is a warning when `-Wundef` is specified (which is in ovrsource for example)
* identifiers starting with two underscores are [reserved for compiler internals](https://en.cppreference.com/w/cpp/language/identifiers)

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D27318070

fbshipit-source-id: 4989fc6a3bf3c176eddd7c25aca47414e4973edd
2021-03-31 22:24:30 -07:00
frdong
d3f784244e fix comparison of narrow type with wide type in loop condition part2 (#54471)
Summary:
Follow up PR of https://github.com/pytorch/pytorch/issues/53951.
This PR fixes remaining semmle warning: comparison of narrow type with wide type in loop condition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54471

Reviewed By: bdhirsh

Differential Revision: D27262493

Pulled By: malfet

fbshipit-source-id: 05765758da79699936af11de237c3ff3d34373d6
2021-03-23 23:38:38 -07:00
frdong
92770d25cd fix comparison of narrow type with wide type in loop condition (#53951)
Summary:
fix Semmle warning: Comparison of narrow type with wide type in loop condition

For example there is below piece of code:
for (int i=0; i<array.size(); ++i) {}

The problem is that array.size() return type is size_t can be larger type than int depending on the implementation so there is chance that i overflows (for very large array that array size is beyond the range of integer) and this loop will never be terminated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53951

Reviewed By: zou3519

Differential Revision: D27181495

Pulled By: malfet

fbshipit-source-id: 0612c5cedcdc656c193085e7fbb87dd163f20688
2021-03-22 16:40:35 -07:00
Gemfield
700c817a6a Add install for libCaffe2_perfkernels_avx*.a (#53825)
Summary:
When build libtorch static library, these three static libraries will be generated but won't be installed to CMAKE_INSTALL_LIBDIR:
- libCaffe2_perfkernels_avx2.a
- libCaffe2_perfkernels_avx512.a
- libCaffe2_perfkernels_avx.a

This PR will fix this issue.

Please be noted that after this fix there still have static libraries missing in CMAKE_INSTALL_LIBDIR, but they belong to third_party repo, and we need to fix in the corresponding repo:
- libfoxi_loader.a
- libonnx.a
- libonnx_proto.a
- libfmt.a
- libnccl_static.a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53825

Reviewed By: ngimel

Differential Revision: D27013844

Pulled By: malfet

fbshipit-source-id: 8a84cc72b6ae87393ca26c4e474f5526a7b18ab2
2021-03-15 08:37:11 -07:00
Qi Zhou
0ec717c830 Support int32 indices and offsets in nn.EmbeddingBag (#46758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Reviewed By: ngimel

Differential Revision: D24470808

fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
2020-11-03 23:33:50 -08:00
Mark Santaniello
1a99689d71 [caffe2] Fix preprocessor checks for FMA
Summary: I think this preprocessor check is incorrect.  The fused multiply-add (FMA) instructions are not part of AVX2.

Test Plan: CI

Reviewed By: jspark1105

Differential Revision: D24237836

fbshipit-source-id: 44f9b9179918332eb85ac087827726300f56224e
2020-10-11 11:48:32 -07:00
Bugra Akyildiz
27c7158166 Remove __future__ imports for legacy Python2 supports (#45033)
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:

```2to3 -f future -w caffe2```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033

Reviewed By: seemethere

Differential Revision: D23808648

Pulled By: bugra

fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
2020-09-23 17:57:02 -07:00
Daya Khudia
09aee06e82 [caffe2] Replace embedding conversion ops with fbgemm functions (#44843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44843

Replace perfkernels calls with fbgemm kernels to avoid code duplication
ghstack-source-id: 112496292

Test Plan: CI

Reviewed By: radkris-git

Differential Revision: D23675519

fbshipit-source-id: 05c285a9eeb9ea109a04a78cb442a24ee40a4aec
2020-09-22 11:57:01 -07:00
Stanislau Hlebik
b774ce54f8 remediation of S205607
fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3
2020-07-17 17:19:47 -07:00
Stanislau Hlebik
8fdea489af remediation of S205607
fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac
2020-07-17 17:17:03 -07:00
Jongsoo Park
7a837019a4 [caffe2] optimize 2/4-bit row-wise quantization (#387)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39985

avx2 optimized 2/4-bit row-wise quantization/dequantization in perfkernels.
This diff slightly change the numerics of quantization by multiplying with the inverse of scale instead of dividing with scale.

Test Plan:
In my devserver

for i in 2 4 8; do echo $i; buck run mode/opt :fused_rowwise_nbit_conversion_bench -- --bit-rate=$i; done

Before this diff
2-bit
        3.35394 ms.        100%. FloatToFused2BitRowwiseQuantized
4-bit
        3.60351 ms.        100%. FloatToFused4BitRowwiseQuantized
8-bit
       0.434467 ms.        100%. FloatToFused8BitRowwiseQuantized

After this diff

2-bit
       0.606386 ms.        100%. FloatToFused2BitRowwiseQuantized
4-bit
       0.446683 ms.        100%. FloatToFused4BitRowwiseQuantized
8-bit
         0.4349 ms.        100%. FloatToFused8BitRowwiseQuantized

Reviewed By: choudharydhruv, jianyuh

Differential Revision: D22033195

fbshipit-source-id: d3a219e47b8345268d90a160c9314ed0d5b71467
2020-06-19 21:28:31 -07:00
Taiqing Wang
8cb1f2f9dc implement L2 regularization for Adagrad in caffe2 and dper (#37705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37372

Posted note: [Regularizing SparseNN Against Over-fitting](https://fb.workplace.com/notes/taiqing-wang/regularizing-sparsenn-against-over-fitting/220306075902708/)

**Problem formulation**

L(w) = J(w) + lambda/2 * ||w||^2
J(w) is the empirical loss, and ||w||^2 is the squared L2 norm of the parameters, a.k.a. L2 regularizer.

dL(w)/ dw_i = dJ(w)/dw_i + lambda w_i
dL(w)/ dw_i is the gradient of L(w) w.r.t. w_i.

To implement the L2 regularizer, the gradient of J(w) w.r.t. w_i is added with w_i. lambda is called as weight decay in this implementation.

**Code changes**
* In the initialization method of AdagradOptimizer, a new input argument, weight_decay, is added.
* In the _run function of AdagradOptimizer, the weight decay will be skipped for 1d bias vectors.
* In the parameter update functions of Adagrad, the gradient is updated by weight_decay * w_i. The default value for weight_decay is zero.

Test Plan:
`
buck build caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay
`

`
./buck-out/gen/caffe2/caffe2/fb/dper/layer_models/tests/split_1/sparse_nn_test_weight_decay#binary.par
`

Reviewed By: jspark1105

Differential Revision: D21258652

fbshipit-source-id: d2366ddcd736a03205a2d16f914703b16d9fce8f
2020-05-03 10:42:49 -07:00
Dmytro Dzhulgakov
7576cf8d00 [caffe2] Use cpuinfo in perfkernels to simplify build dependency (#36371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36371

It allows to drop circular dependency and remove unknown_symbols in Buck build.

It'd be good to get rid of GetCpuId all together in favor of cpuinfo, but it's not really blocking anything

Reviewed By: malfet

Differential Revision: D20958000

fbshipit-source-id: ed17a2a90a51dc1adf9e634af56c85f0689f8f29
2020-04-10 13:26:34 -07:00
Evgeny Fiksman
e372f42110 [caffe2] Explicit vectorization of LSTM operator (#35556)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35542

Apply explicit vectorization to lstm_unit operator.
Enabled by -DENABLE_VECTORIZATION=1

This optimization requires vector library support and was tested with Intel SVML & clang.
However, compiler which support OpenMP4.5 with omp simd extention should also benefit.

After the code changes
In file included from caffe2/caffe2/operators/lstm_unit_op.cc:1:
caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {

caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
caffe2/caffe2/operators/lstm_unit_op.h:112:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize]
VECTOR_LOOP for (int d = 0; d < D; ++d) {

Test Plan:
Check failures at OSS CI
- No build failures related to this change
- Failing tests are:
  - py3.6-clang7-rocmdeb-ubuntu16.04-test2
>RuntimeError: fft: ATen not compiled with MKL support
  - caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test -
>gradient_check_test.py::TestMakeTwo
Exited with code exit status 1
- pytorch_macos_10_13_py3_test , Test errors like:
> ERROR [0.014s]: test_boolean_indexing_weirdness_cpu (__main__.NumpyTestsCPU)
RuntimeError: shape mismatch: indexing tensors could not be broadcast together with shapes [0], [2]
- caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test
  - No failure info

Reviewed By: jspark1105

Differential Revision: D20484640

fbshipit-source-id: 8fb82dbd6698c8de3e0bbbc0b48d15b70e36ca94
2020-04-01 17:19:56 -07:00
peter
e3daf70184 Fix AVX detection with clang-cl (#35653)
Summary:
Defining macros `/D__F16C__` or sth similar won't work on clang-cl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35653

Differential Revision: D20735878

Pulled By: ezyang

fbshipit-source-id: 392a664b0a9e74222b1a03b8c3f6ebb2c61d867e
2020-03-30 07:53:37 -07:00
peter
45c9ed825a Formatting cmake (to lowercase without space for if/elseif/else/endif) (#35521)
Summary:
Running commands:
```bash
shopt -s globstar

sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i caffe2/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i torch/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i c10/**/CMakeLists.txt
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake
sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake/**/*.cmake.in
```
We may further convert all the commands into lowercase according to the following issue: 77543bde41.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35521

Differential Revision: D20704382

Pulled By: malfet

fbshipit-source-id: 42186b9b1660c34428ab7ceb8d3f7a0ced5d2e80
2020-03-27 14:25:17 -07:00
Jongsoo Park
a7fe200f5f [caffe2] simplify caffe2 code with fbgemm handling block size 1 emb (#33774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33774

Simplify caffe2 code using D19246900

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D20102410

fbshipit-source-id: 8de4d9cfac66898db0718ac6477339fd5e5428e3
2020-02-27 14:45:28 -08:00
Jongsoo Park
c57f8984e6 [caffe2] make order btw div and mul in adgrad consistent (#32974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32974

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/286

Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad

There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad.

The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row

And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit)

Test Plan: CI

Reviewed By: wx1988

Differential Revision: D19342865

fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341
2020-02-16 22:45:59 -08:00
Jianyu Huang
a840afbeb4 [pytorch][embeddingbag_8bit] Add include_last_offset option to Fused 8bit EmbeddingBag and parallelize the op (#32683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32683

Pull Request resolved: https://github.com/pytorch/glow/pull/4079

Similar to D17768404, we changed the EmbeddingBag operator for 8-bit fused version to add the option to include the last offset and parallelize the op.
ghstack-source-id: 97404645

Test Plan:
To generate the AVX2 code (`embedding_lookup_fused_8bit_rowwise_idx_avx2.cc`):
```
python hp_emblookup_codegen.py --fused --use-offsets
```

To test the correctness:

```
buck test //caffe2/torch/fb/sparsenn:test -- test_embedding_bag_byte_rowwise_offsets  --print-passing-details
```

Reviewed By: yinghai

Differential Revision: D19592761

fbshipit-source-id: f009d675ea3f2228f62e9f86b7ccb94700a0dfe0
2020-01-29 16:04:56 -08:00
Jianyu Huang
3ada2e0d64 [pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049)
Summary:
Pull Request resolved: https://github.com/pytorch/glow/pull/4049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477

We would like to add the intra-op parallelization support for the EmbeddingBag operator.

This should bring speedup for the DLRM benchmark:
https://github.com/pytorch/pytorch/pull/24385

Benchmark code:
```
from __future__ import absolute_import, division, print_function, unicode_literals

import torch
import time

eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum')

input = torch.LongTensor(1500).random_(0, 1000000)
offsets = torch.zeros(64, dtype=torch.int64)

niter = 10000
s = time.time()
for _ in range(niter):
    out = eb(input, offsets)
time_per_iter = (time.time() - s) / niter
print('time_per_iter', time_per_iter)
print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9)
```

The following results are single core on Skylake T6:
- Before our change (with the original caffe2::EmbeddingLookup)
time_per_iter 6.313693523406982e-05
GB/s 6.341517821789133

- After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths.
time_per_iter 5.7627105712890626e-05
GB/s 6.947841559053659

- With Intel's PR: https://github.com/pytorch/pytorch/pull/24385
time_per_iter 7.393271923065185e-05
GB/s 5.415518381664018

For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6.
ghstack-source-id: 97124557

Test Plan:
With D16990830:
```
buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench
```

With D17750961:
```
buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb
buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb
```

OSS test
```
python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu
```

Buck test
```
buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu"

OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets"  --print-passing-details
```

Generate the AVX2 code for embedding_lookup_idx_avx2.cc:
```
python hp_emblookup_codegen.py --use-offsets
```

Differential Revision: D17768404

fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700
2020-01-23 21:29:44 -08:00
Brian Wignall
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
Hector Yuen
9e9ca6ec37 add conversion functions to embedding tables (#31083)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31083

add (fp32/fp16)<->(int8 rowwise quantized fp32/fp16 scale biases)

Test Plan:
added unit tests
enhanced shape inference tests

Reviewed By: jspark1105

Differential Revision: D18920547

fbshipit-source-id: 6b3d7cb93f9d1669ecf511817d73976177632891
2020-01-08 16:56:12 -08:00
Jongsoo Park
7a12ccd003 optimize FloatToFused8BitRowwiseQuantized and Fused8BitRowwiseQuantizedToFloat (#31470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31470

Optimize performance of these two operators.
Additionally use nearbyint instead of round to be consistent with 4-bit embedding table quantization.

Reviewed By: hyuen

Differential Revision: D19072103

fbshipit-source-id: efe96f14aeff7958cceb453ed625d3fd693891ff
2019-12-20 10:09:26 -08:00
Jongsoo Park
e09c415387 Back out "make the order btw div and mul in adagrad update consistent" (#30737)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30737

Original commit changeset: 2a8b2a3f5401

Reverting this to be safe until we address test failures in T58528495

Test Plan: CI

Reviewed By: wx1988

Differential Revision: D18812384

fbshipit-source-id: 2a3ac554024773022ec827f259127e4c8cffe6e2
2019-12-04 17:43:45 -08:00
Jongsoo Park
d32f261f16 make the order btw div and mul in adagrad update consistent (#30449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30449

There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad.
In this diff we first compute effective_lr = lr / (sqrt(moment) + epsilon) and then multiply with gradient.

Test Plan: CI

Reviewed By: protonu

Differential Revision: D18703416

fbshipit-source-id: 2a8b2a3f5401466549561412bd22f07abac3c598
2019-12-02 13:53:38 -08:00
Jongsoo Park
649e7f057e fix comment index_size->output_size (#29831)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29831

As title. Thanks Aleks Zi for finding this!

Test Plan: Just changing comments

Reviewed By: zlateski

Differential Revision: D18511259

fbshipit-source-id: 5f1ad9ba53db9b22622a556ec214ced361ec016a
2019-11-16 01:49:02 -08:00
James Donald
7f485121a6 Avoid MSVC _cvtsh_ss() workaround with clang-cl (#29726)
Summary:
We (me fnabulsi bmcdb) have a handful of fixes used locally to build and run with clang-cl. I am aware of https://github.com/pytorch/pytorch/issues/8784 but it has not been touched in almost a year.

It may be more practical to upstream the non-controversial fixes piecewise. For example, this one.

Here, the dummy version of `_cvtsh_ss` for MSVC is not required (and hence causes conflicts) when using clang-cl so can be #ifdef'd out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29726

Differential Revision: D18478120

Pulled By: ezyang

fbshipit-source-id: cdcd94251e68347446f2ad1ac5a0e71089f7d0ab
2019-11-13 12:49:13 -08:00