Commit Graph

93 Commits

Author SHA1 Message Date
jjsjann123
b21a6ff639 [NVFuser] Upstream push 0811 (#83239)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. double support in expression evaluator
- bug fixes:
  1. dropout fix - rework RNG to support broadcasted dropout (Fixes #82784)
  2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard
- scheduler:
  1. manual transpose schedule example
  2. WIP transpose scheduler

Commits that's in this PR from the devel branch:

```
b7435afcd22c917713c2f41a7237bc26e1183f14 Transpose scheduler, step 1 (#1854)
8a45dbf72034684eb8e18b1835b533e90b68f184 Add an example on how to manually schedule transpose (#1889)
83dbf56a9554b2efbd5416461d938fff477b0b27 Patch dropout fix (#1898)
69d3519a532250719b1aa8341b50e067b181b42d Expand+Reduction, Expand+View support, rework View analysis and guards (#1883)
15091c488e96343bdc49e3990acbf238a3b3da51 Rework RNG to correctly support broadcasted dropout (#1888)
aafe2d048aaac596e503596a41303423619f3954 Make ExpressionEvaluator support Double (#1885)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83239
Approved by: https://github.com/davidberard98
2022-08-25 02:23:22 +00:00
Richard Barnes
b745e5f115 Check all CUDA API calls for errors in benchmarks/cpp/nvfuser (#74920) (#81817)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74920

Test Plan: Sandcastle

Differential Revision: D35194656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81817
Approved by: https://github.com/malfet
2022-08-24 18:59:05 +00:00
jjsjann123
df741c589f [NVFuser] Upstream push 0809 (#83067)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. removes un-necessary sync from redundant thread compute analysis
  2. symmetric API for BestEffortReplay
  3. support merge on trivial reductions
  4. Ampere async copy improvements
- bug fixes:
  1. vectorization bug fixes
  2. type inference patch : fixes upstream #81725
  3. segmenter bug fix with deterministic iteration ordering
- parser update
  1. added leaky_relu
- scheduler
  1. normalization scheduler clean up.
  2. simplifies matmul scheduling with new transform propagator
  3. merge all dimensions in PW scheduler
  4. various gemm related improvements
- debuggability
  1. nsight compute support
  2. debug dump for InlinePropagator
  3. Add `UnaryOpType::Print`

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
dfe02f3faed4c64477e5f5c678f21f33415d0195 Merge remote-tracking branch 'csarofeen/devel' into HEAD
16173732ecfafc4797e93c2449cfb778015a6c7a Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884)
7cfb7796bdcf055eb61d600b7b5c9df292950290 Merge pull request #1887 from csarofeen/upstream_merge_0803
3399f6de62061d30781de50ef1862bbfb1615173 Merge remote-tracking branch 'origin/viable/strict' into HEAD
01208f5bba3bc158d41ccbefa0ee2c5ceea7aedb Add `UnaryOpType::Print` which can be helpful for debugging (#1878)
0646522454aa715ef164c88a73fb8bdddc706805 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881)
7bc76aa219293a59e4166e258d76289fe13633ca Fix most inlined propagator for mismatched dims (#1875)
501f4aa270bf4dd47b0d2f4860bc6f23ebc32a38 Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826)
d863d690f923047a85b5229a787118708f810741 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827)
e0ae11a61c87cd998e88ddd79a496548171c31e0 Larger sized mma instructions to support full vectorization (#1824)
9bb4cf7a66b098f04c9d95a2d34ab2bceee151b3 fragment iteration to support fully unrolled mma ops (#1823)
a48270a18dc2d3accc2626758d14d5858ae55032 Merge all dims in pointwise scheduler (#1872)
172fb3673fb4aaf4c1e889922a4fc5c06cbd59f7 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868)
a64462a5ac2fcf57a177bf36b0f26c61a4e252a4 Allow trivial reduction to be merged (#1871)
440102bcda6eb1dcd42d5fa5aeab9d6b049956bc Symmetric API for BestEffortReplay (#1870)
d1caf330c08ea8002f7133ca655bbd5b28c4eb98 Some misc cleanups/refactor split out from #1854 (#1867)
1013eda50be38eac96c00ba781340ac199d5a136 Remove some welford specific logic. (#1864)
51589d36be5a101d06e641fe0400b39028b7cb81 Some cleanups on tests and heuristics params (#1866)
a6b3e70da5dee51dbc246347228ea21384e46ac3 Segmenter bug fix, and deterministic iteration ordering.  (#1865)
1b665b9b5e562d6f0caba5e7319e83e5df64104f Add nullptr checks to IrBuilder (#1861)
1cd9451d7493f631c2837ba07c1ea93a74e83a15 Simplify matmul scheduling with the new transform propagator.  (#1817)
bbc1fb9b8c454f557ab9fcf5b1c3cef9b9e136d0 Add leaky_relu operation (#1852)
e842a9bab5e9f7289b7ce33ee37a682b22373f49 Minor cleanup in pointwise scheduler (#1858)
9ee850ca2f7f51dd5269bffb1255e485f809282d Fix stringstream usage (#1857)
20a36c1e4f28c4ff9837e56784be2686d17435f3 Improve nsight compute support (#1855)
405910308301097297b55c34d560aab6a360e897 Remove debugging `true ||` from getPointwiseHeuristics (#1822)
01117bfe8fdfacdbfdcfba9a624cdf900fe044d4 Misc cleanup (#1853)
5cc64943dc381a568223140bce0f22163c01e29f Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846)
92e6f0207e3a89fe90fd5cd3ffc575dfd766ba00 Cleanup normalization scheduler (#1845)
db89c6591a2f21130599a93675e0615e55564e41 Type inference patch (#1848)
102fe93a4605ca465cda26ebaee4ba1af2026901 Add debug dump for InlinePropagator (#1847)
b7a4d93d375a6e2ddef483763c93ffddc62ec452 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687)
942be5b256056d0e02877361b814ae6af32ca15f Upstream ci build fixes (#1842)
0b83645915029d67f9345aa4649b8c6f62b0061b Fix vectorization bug introduced in #1831 (#1840)
63630f1ae091180e541932a9d9dc598e0a9902dd Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825)
9135a963c01d97ba34b1a7d2f106e78a13fd6651 Fix transpose benchmark dtype (#1839)
2c9a6c02312d5bf4f83cde653b847b4f85849432 Add extra configurability to `parallelizeAllLike` (#1831)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83067
Approved by: https://github.com/davidberard98
2022-08-10 21:02:56 +00:00
Nikita Shulga
62c8d30f9f [BE] Add append_cxx_flag_if_supported macro (#82883)
And use it throughout the CMakeLists and rectify `IF(APPLE)`/`IF(GNU_CXX_VERSION VERSION_GREATER A.B)` and so on

Also, add `target_compile_options_if_supported` and use it in `Dependencies.cmake` as well as in test's `CMakeListst.txt`

Delete `-Wno-unknown-warning-option` to test that conditions indeed working as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82883
Approved by: https://github.com/seemethere
2022-08-10 14:32:26 +00:00
PyTorch MergeBot
d3a1f17fc7 Revert "[BE] Add append_cxx_flag_if_supported macro (#82883)"
This reverts commit d7e6aaa59b.

Reverted https://github.com/pytorch/pytorch/pull/82883 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-10 10:27:59 +00:00
Nikita Shulga
d7e6aaa59b [BE] Add append_cxx_flag_if_supported macro (#82883)
And use it throughout the CMakeLists and rectify `IF(APPLE)`/`IF(GNU_CXX_VERSION VERSION_GREATER A.B)` and so on

Also, add `target_compile_options_if_supported` and use it in `Dependencies.cmake` as well as in test's `CMakeListst.txt`

Delete `-Wno-unknown-warning-option` to test that conditions indeed working as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82883
Approved by: https://github.com/seemethere
2022-08-08 21:04:09 +00:00
jjsjann123
8d753c8062 [WIP] Upstream push 0627 (#80355)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- TransformPropagator refactor: switched to Dijkstra instead of exhaustive enumeration on all possible paths to reduce compilation time on transform propagation;
- Indexing refactor: remove reference tensor creation in all tensor indexing logic (#1690)
- (more) generic grouped grid reduction kernel;
- Minor parser/fuser patches:
  1. zero-dim tensor reduction support
  3. no-op binary removal within fused graph
  4. expand supported in fusion

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
a054b3efcf5af58ea518de283f55aaf9fe06ff5f Refactor TransormPropagator to allow specifying a position and propagating to part of the DAG (#1775)
d67e1cda9b802036841a371318014a818a849b0a Indexing refactor stage 1: remove reference tensor creation in all tensor indexing logic (#1690)
1b6529956a1ace220898ad09dde0bf85e49827f7 Issue 1770 (#1774)
35b04276b648c9b55cdb6a67f3889f54e745c3d2 Avoid compilation errors like below: (#1773)
452c77326a340d2a4130b7802f4f319aec60e72a Ignore reductions of zero-dim tensors per PyTorch conventions (#1771)
31d6c56d88afba09ac53b2d5dd3493d625f8cd57 TransformPropagator refactor (#1769)
570c5a84b91a3cf67207331be9650d26a2d37e3d Merge pull request #1767 from csarofeen/upstream_merge_0621
9d6c3d84be86da643df6fd51695543938111f20d merging upstream 61305cd638
0ed815f76b08f285bda855dd500692ff10a8abce New TransformPropagator algorithm (#1763)
6c195200c0a92fb0f38c833431a8940ed07569b9 no-op binary removal (#1764)
ec7fa4187c177186527409dfc5c7b1754d30bc92 Proper propagation of IterType (#1762)
b263562dbc3c865007ad7d7d42a58a20be8d7922 Fix dimensionality check (#1759)
2d6343f6cc1e47b63ef20a50d1446f6480736478 More generic grouped grid reduction kernel (#1740)
64e2b56df2c8b9fd22a362d9cc05974a8607ef3d [nvfuser] prevent spamming warning message (#77777) (#1758)
0c431624ff15b6458b9f9b674a3852373fc426b1 [nvFuser] Improving bitwise ops support (#77158) (#1757)
b93a14777fde3b9b39684b9cf1715651a806b281 Parser expand (#1754)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80355
Approved by: https://github.com/davidberard98
2022-07-13 19:34:31 +00:00
jjsjann123
c9c402eae9 [nvfuser_upstream_push] Reland: nvfuser code base bump 060822 (#79406)
Landing reverted PR #79147.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79406
Approved by: https://github.com/davidberard98
2022-06-16 17:52:21 +00:00
PyTorch MergeBot
d28e9e145b Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)"
This reverts commit 49c41b87a2.

Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk 49c41b87a2
2022-06-10 20:55:10 +00:00
jjsjann123
49c41b87a2 [nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
2022-06-10 19:37:42 +00:00
jjsjann123
a2802ad0b9 Upstream master bump 0513 (#77471)
Updating nvfuser code base.

This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471
Approved by: https://github.com/seemethere, https://github.com/eellison
2022-05-18 11:48:50 -07:00
Wang, Eikan
429a80dded [NNC] Lowering function generates the output buffer with the specified stride (#76529)
Summary:
Pass stride information to lowering function to generate the output bufer with proper memory layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76529

Reviewed By: ZolotukhinM

Differential Revision: D36116712

Pulled By: IvanKobzarev

fbshipit-source-id: d3901f756b3710ecce172d6db3ecb0b7c12fb929
(cherry picked from commit b6cd53c91c01db36ea0e99167dc0ce0ae1d3aa23)
2022-05-04 20:04:22 +00:00
zengk95
1d55518198 Revert "[nnc] Strides to Tensor (#72962)"
This reverts commit 939060925f.

Fixes https://github.com/pytorch/vision/issues/5873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76332
Approved by: https://github.com/seemethere
2022-04-25 19:50:00 +00:00
Ivan Kobzarev
939060925f [nnc] Strides to Tensor (#72962)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72962

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM, cpuhrsch

Differential Revision: D34589306

Pulled By: IvanKobzarev

fbshipit-source-id: ecee5249760ecc0c8b2edb1842b90218899bc944
(cherry picked from commit 9e310c4c67389da30da89126d838ffe3864aba6f)
2022-04-23 19:35:15 +00:00
Nikita Shulga
80ea6955af Add cuda-11.3+clang9 build workflow (take 2)
To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic)

Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu`

Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests
Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293
Approved by: https://github.com/atalman, https://github.com/seemethere
2022-04-11 17:13:01 +00:00
PyTorch MergeBot
8fe43d76d5 Revert "Add cuda-11.3+clang9 build workflow"
This reverts commit 709fcc862e.

Reverted https://github.com/pytorch/pytorch/pull/75293 on behalf of https://github.com/janeyx99
2022-04-11 15:24:59 +00:00
Nikita Shulga
709fcc862e Add cuda-11.3+clang9 build workflow
To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic)

Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu`

Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests
Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293
Approved by: https://github.com/atalman, https://github.com/seemethere
2022-04-11 14:10:57 +00:00
jjsjann123
873ced7cd0 Nvfuser code bump 030122 (#73627)
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
2022-03-31 08:18:22 +00:00
jiej
2d110d514f Nvfuser code bump 2_1_2022 (#72127)
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported

nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor

Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127

Reviewed By: HamidShojanazeri

Differential Revision: D34113233

Pulled By: jbschlosser

fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
2022-02-15 00:43:16 +00:00
Mikhail Zolotukhin
1855b14922 [TensorExpr] Delet DimArg class. (#72390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72390

This class didn't add much value and only caused more boilerplate code.
This change removes the class and updates all the use cases with
uses of `ExprHandle`.

A side effect of this change is different names in loop variables, which
caused massive mechanical changes in our tests.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D34030296

Pulled By: ZolotukhinM

fbshipit-source-id: 2ba4e313506a43ab129a10d99e72b638b7d40108
(cherry picked from commit c2ec46a058)
2022-02-11 01:21:59 +00:00
Raghavan Raman
4eb277ac61 [bench] Adding a cpp benchmark to compare performance of nnc with static and symbolic shapes (#72197)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72197

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D33951742

Pulled By: navahgar

fbshipit-source-id: 0412d61da158e98429f377469e1c331587390b14
(cherry picked from commit c043fdfc79)
2022-02-07 07:01:19 +00:00
Raghavan Raman
237e960ec9 [bench] Fix build issues with TensorExpr cpp benchmarks (#72196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72196

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33951743

Pulled By: navahgar

fbshipit-source-id: f1b36bb3ba9cd649f0dbf0911f5a9e4791089e65
(cherry picked from commit fbe5cadb5f)
2022-02-07 07:01:19 +00:00
Raghavan Raman
38f696c0cd [nnc] Add a API to unroll loops by a given factor (#72071)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72071

Reviewed By: ngimel

Differential Revision: D33946250

Pulled By: navahgar

fbshipit-source-id: 3f3f92054174620025a9d71154d006f1738953e2
(cherry picked from commit d8b53598e9)
2022-02-03 18:41:21 +00:00
Richard Barnes
29d759948e use irange for loops 2 (#66746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705361

fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
2021-12-10 04:26:23 -08:00
CodemodService FBSourceClangFormatLinterBot
143491e0ad [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D32484422

fbshipit-source-id: 5c836dc7d06f12e64cc4bb1e85d8fa4b62a29b85
2021-11-17 07:27:04 -08:00
jjsjann123
0dc3f829d9 Nvfuser code bump 11 5 (#67943)
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
2021-11-17 01:22:17 -08:00
Hao Lu
938bab0bfd [PyTorch] Add int version of vectorized PrefixSum to Benchmark (#67865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67865

- Add int version of vectorized PrefixSum
- Use unaligned load/store instructions
- Add exclusive scan version. "exclusive" means that the i-th input element is not included in the i-th sum. For details see https://en.cppreference.com/w/cpp/algorithm/exclusive_scan

Test Plan:
```
buck build mode/opt-clang //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 numactl -m 0 -C 5 \
./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench
```

For full benchmark results, see P465274613

```
PrefixSumBench/LocalInt/64                            57 ns         56 ns   12414048 GB/s=9.06239G/s
PrefixSumBench/LocalInt/256                          221 ns        221 ns    3160853 GB/s=9.28635G/s
PrefixSumBench/LocalInt/1024                         818 ns        817 ns     857922 GB/s=10.0235G/s
PrefixSumBench/LocalInt/4096                        3211 ns       3210 ns     217614 GB/s=10.2093G/s
PrefixSumBench/LocalInt/16384                      12806 ns      12804 ns      54805 GB/s=10.2364G/s
PrefixSumBench/LocalInt/65536                      51115 ns      51079 ns      13741 GB/s=10.2643G/s
PrefixSumBench/LocalInt/262144                    205974 ns     205912 ns       3401 GB/s=10.1847G/s
PrefixSumBench/LocalInt/1048576                   829523 ns     828859 ns        845 GB/s=10.1207G/s
PrefixSumBench/LocalIntAVX2/64                        45 ns         45 ns   15568113 GB/s=11.3549G/s
PrefixSumBench/LocalIntAVX2/256                      208 ns        208 ns    3371174 GB/s=9.86913G/s
PrefixSumBench/LocalIntAVX2/1024                     893 ns        892 ns     783154 GB/s=9.18629G/s
PrefixSumBench/LocalIntAVX2/4096                    3618 ns       3613 ns     193834 GB/s=9.06838G/s
PrefixSumBench/LocalIntAVX2/16384                  14416 ns      14411 ns      48564 GB/s=9.09543G/s
PrefixSumBench/LocalIntAVX2/65536                  57650 ns      57617 ns      12156 GB/s=9.09952G/s
PrefixSumBench/LocalIntAVX2/262144                230855 ns     230612 ns       3035 GB/s=9.09386G/s
PrefixSumBench/LocalIntAVX2/1048576               924265 ns     923777 ns        758 GB/s=9.08077G/s
PrefixSumBench/LocalIntAVX512/64                      23 ns         23 ns   24876551 GB/s=22.0697G/s
PrefixSumBench/LocalIntAVX512/256                     95 ns         95 ns    7387386 GB/s=21.556G/s
PrefixSumBench/LocalIntAVX512/1024                   435 ns        435 ns    1609682 GB/s=18.8425G/s
PrefixSumBench/LocalIntAVX512/4096                  1815 ns       1815 ns     385462 GB/s=18.0561G/s
PrefixSumBench/LocalIntAVX512/16384                 7479 ns       7476 ns      93660 GB/s=17.5335G/s
PrefixSumBench/LocalIntAVX512/65536                30171 ns      29879 ns      23430 GB/s=17.5468G/s
PrefixSumBench/LocalIntAVX512/262144              125805 ns     125631 ns       5570 GB/s=16.6929G/s
PrefixSumBench/LocalIntAVX512/1048576             504216 ns     503983 ns       1384 GB/s=16.6446G/s
PrefixSumBench/ExclusiveScanIntAVX512/64              23 ns         23 ns   30058295
PrefixSumBench/ExclusiveScanIntAVX512/256            101 ns        101 ns    7398498
PrefixSumBench/ExclusiveScanIntAVX512/1024           435 ns        434 ns    1403877
PrefixSumBench/ExclusiveScanIntAVX512/4096          1979 ns       1978 ns     354016
PrefixSumBench/ExclusiveScanIntAVX512/16384         7828 ns       7819 ns      89551
PrefixSumBench/ExclusiveScanIntAVX512/65536        31206 ns      31192 ns      22408
PrefixSumBench/ExclusiveScanIntAVX512/262144      130106 ns     130023 ns       5388
PrefixSumBench/ExclusiveScanIntAVX512/1048576     525515 ns     524976 ns       1244
```

Reviewed By: navahgar, swolchok

Differential Revision: D32011740

fbshipit-source-id: 7962de710bd588291dd6bf0c719f579c55f7c063
2021-11-04 14:00:19 -07:00
Shashank Chaudhry
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
Xue Li
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
Richard Barnes
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
Nikita Shulga
4c4525fa5c Compile without -Wno-unused-variable (take 2) (#66041)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Do not delete `caffe2::OperatorBase::Output` calls as they have side effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041

Reviewed By: ngimel

Differential Revision: D31360142

Pulled By: malfet

fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
2021-10-04 20:39:39 -07:00
Nikita Shulga
e4ee5ca698 Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable
Test Plan: revert-hammer

Differential Revision:
D31326599 (a6280ab653)

Original commit changeset: 924155f1257a

fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf
2021-10-01 20:40:47 -07:00
Nikita Shulga
a6280ab653 Compile without -Wno-unused-variable (#65954)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954

Reviewed By: ngimel

Differential Revision: D31326599

Pulled By: malfet

fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3
2021-10-01 17:40:47 -07:00
Mikhail Zolotukhin
3a0165da49 [TensorExpr] Port NNC lowerings to the new registry mechanism. (#65551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551

Previously we had a big switch on Op kind to decide how to lower a given
JIT operator to NNC. This PR changes this switch to a hash table lookup.

Why? This helps us with at least two things:
1) With this approach we can easily check if we know how to handle a
given node in advance - i.e. we can inspect the entire graph and tell
whether it's possible to compile it or not without actually trying to do
that and dying in the middle. This would allow us to, say, provide
user-friendly error messages in AOT workflow.
2) We can switch to use schema instead of op kind to determine correct
lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963)
and using it instead of schema can lead to bugs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148926

Pulled By: ZolotukhinM

fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704
2021-09-30 22:56:18 -07:00
Raghavan Raman
8f3983254b [MicroBench] Added a micro benchmark for prefix sum (#65790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65790

Here are the results of the benchmark:

* ATen - version that calls `at::cumsum`
* NNC - a simple prefix-sum loop implemented in NNC (not vectorized)
* Local - a C++ implementation of the simple prefix-sum loop
* LocalAVX2 - a vectorized C++ implementation of prefix-sum, only using AVX2
* LocalAVX512 - a vectorized C++ implementation of prefix-sum, using AVX512.

The vectorized implementations are from the paper "Parallel Prefix Sum with SIMD" in ADMS' 20.

```
$ OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench
Run on (36 X 1601 MHz CPU s)
2021-09-28 23:13:12
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
PrefixSumBench/ATen/64                   1289 ns       1289 ns     543199 GB/s=397.069M/s
PrefixSumBench/ATen/256                  1867 ns       1867 ns     374232 GB/s=1096.8M/s
PrefixSumBench/ATen/1024                 4169 ns       4169 ns     167889 GB/s=1.9649G/s
PrefixSumBench/ATen/4096                14137 ns      14136 ns      49266 GB/s=2.31806G/s
PrefixSumBench/ATen/16384               49887 ns      49883 ns      13988 GB/s=2.6276G/s
PrefixSumBench/ATen/65536              193742 ns     193686 ns       3628 GB/s=2.7069G/s
PrefixSumBench/ATen/262144             764803 ns     764774 ns        917 GB/s=2.74219G/s
PrefixSumBench/ATen/1048576           3040653 ns    3040277 ns        231 GB/s=2.75916G/s
PrefixSumBench/Local/64                   586 ns        586 ns    1197003 GB/s=873.244M/s
PrefixSumBench/Local/256                 1077 ns       1077 ns     646265 GB/s=1.90143G/s
PrefixSumBench/Local/1024                3050 ns       3050 ns     229458 GB/s=2.68579G/s
PrefixSumBench/Local/4096               11910 ns      11910 ns      58953 GB/s=2.75132G/s
PrefixSumBench/Local/16384              43204 ns      43202 ns      16081 GB/s=3.03393G/s
PrefixSumBench/Local/65536             167966 ns     167966 ns       4154 GB/s=3.12139G/s
PrefixSumBench/Local/262144            667631 ns     667613 ns       1048 GB/s=3.14127G/s
PrefixSumBench/Local/1048576          2654785 ns    2654631 ns        264 GB/s=3.15999G/s
PrefixSumBench/NNC/64                     642 ns        642 ns    1095277 GB/s=797.442M/s
PrefixSumBench/NNC/256                   1139 ns       1138 ns     617214 GB/s=1.799G/s
PrefixSumBench/NNC/1024                  3103 ns       3103 ns     225531 GB/s=2.63979G/s
PrefixSumBench/NNC/4096                 12053 ns      12052 ns      58084 GB/s=2.71883G/s
PrefixSumBench/NNC/16384                43227 ns      43225 ns      16192 GB/s=3.03231G/s
PrefixSumBench/NNC/65536               168065 ns     168056 ns       4153 GB/s=3.11972G/s
PrefixSumBench/NNC/262144              668974 ns     668921 ns       1045 GB/s=3.13513G/s
PrefixSumBench/NNC/1048576            2657464 ns    2657341 ns        263 GB/s=3.15677G/s
PrefixSumBench/LocalAVX2/64               523 ns        523 ns    1351308 GB/s=979.537M/s
PrefixSumBench/LocalAVX2/256              755 ns        755 ns     927762 GB/s=2.71159G/s
PrefixSumBench/LocalAVX2/1024            1759 ns       1759 ns     400355 GB/s=4.65609G/s
PrefixSumBench/LocalAVX2/4096            6708 ns       6706 ns     103959 GB/s=4.88649G/s
PrefixSumBench/LocalAVX2/16384          22143 ns      22142 ns      31229 GB/s=5.91951G/s
PrefixSumBench/LocalAVX2/65536          83649 ns      83642 ns       8350 GB/s=6.26828G/s
PrefixSumBench/LocalAVX2/262144        330433 ns     330427 ns       2133 GB/s=6.34679G/s
PrefixSumBench/LocalAVX2/1048576      1302301 ns    1302179 ns        537 GB/s=6.44198G/s
PrefixSumBench/LocalAVX512/64             474 ns        474 ns    1459151 GB/s=1080.8M/s
PrefixSumBench/LocalAVX512/256            576 ns        576 ns    1217442 GB/s=3.55524G/s
PrefixSumBench/LocalAVX512/1024           994 ns        994 ns     703387 GB/s=8.24434G/s
PrefixSumBench/LocalAVX512/4096          3642 ns       3641 ns     190646 GB/s=8.99857G/s
PrefixSumBench/LocalAVX512/16384        10140 ns      10140 ns      68947 GB/s=12.9267G/s
PrefixSumBench/LocalAVX512/65536        35739 ns      35736 ns      19567 GB/s=14.6711G/s
PrefixSumBench/LocalAVX512/262144      156415 ns     156413 ns       4467 GB/s=13.4078G/s
PrefixSumBench/LocalAVX512/1048576     613952 ns     613876 ns       1144 GB/s=13.665G/s
```

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D31253849

Pulled By: navahgar

fbshipit-source-id: f33e7be787c86a09e90babddd66b16e2e0777eb4
2021-09-30 14:44:52 -07:00
jiej
127c9402d0 Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137)
Summary:
This reverts commit 03389dc851.

Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745
Fixes the windows build failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d
2021-09-22 04:54:51 -07:00
Eli Uriegas
03389dc851 Revert D30752939: [pytorch][PR] nvfuser update
Test Plan: revert-hammer

Differential Revision:
D30752939 (cfaecaf40b)

Original commit changeset: ce122e80f01b

fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2
2021-09-15 17:38:47 -07:00
jiej
cfaecaf40b nvfuser update (#63745)
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
2021-09-15 14:42:55 -07:00
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Raghavan Raman
2cc9778495 [MicroBench] Added a log_vml version of the signed log1p kernel (#64205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64205

The log_vml version of the micro-bench is over **2x** faster than the log1p version. Here are the perf numbers:

```
---------------------------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------
SignedLog1pBench/ATen/10/1467           45915 ns        45908 ns        14506 GB/s=2.5564G/s
SignedLog1pBench/NNC/10/1467            40469 ns        40466 ns        17367 GB/s=2.9002G/s
SignedLog1pBench/NNCLogVml/10/1467      19560 ns        19559 ns        35902 GB/s=6.00016G/s
```

Thanks to bertmaher for pointing this out.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30644716

Pulled By: navahgar

fbshipit-source-id: ba2b32c79d4265cd48a2886b0c62d0e89ff69c19
2021-09-10 16:49:06 -07:00
Raghavan Raman
dc4fd3bdda [MicroBench] Added a micro benchmark for a signed log1p kernel. (#64032)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64032

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30579198

Pulled By: navahgar

fbshipit-source-id: a53d68225fba768b26491d14b535f8f2dcf50c0e
2021-08-30 09:27:51 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin
dd96c26066 [TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778

This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30487425

Pulled By: ZolotukhinM

fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
2021-08-24 00:30:49 -07:00
Philip Meier
99203580a9 Updates internal assert_allclose callsites in favor of assert_close (#61841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61841

Redo of #60863.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30408145

Pulled By: mruberry

fbshipit-source-id: 0b34ebc7f23ba38ecd89640b61d8aca59b7eab58
2021-08-19 12:50:41 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Bert Maher
93772792e3 [nnc] Get rid of fuser trigger counters (#57334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57334

Here's a possibly controversial PR.  These counters got in the way of
generalizing the fuser tests to handle arbitrary devices, and I guess I'm just
generally skeptical that they provide much value.  While true that they let us
observe whether fusion groups were created, we already have assertions based on
the shape of the graph, and I'm not sure that I trust those any less than these
counters.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29471484

Pulled By: bertmaher

fbshipit-source-id: f6d76f6e72dbfb581acff1d834b0c74500941b57
2021-06-29 22:22:15 -07:00
Bert Maher
10e11dbdcd Reland D29190420: [nnc][tests] Tests and benchmarks for computeSum (#60550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60550

Original commit changeset: ed655497a981

Whatever gcc version OSS Bazel uses wasn't happy move-constructing the
SimpleIREvaluator, so use a unique_ptr instead.

Test Plan:
CI.  Hope that the gcc version used by OSS Bazel build is
happier with this (it should be), since actually testing it locally is
an intractable pain.

Reviewed By: navahgar

Differential Revision: D29333116

fbshipit-source-id: c3e4b5d8c91eb96a43ae5315a01ca0c0f4d4a99d
2021-06-23 10:50:03 -07:00
Anjali Chourdia
b14f19b6fe Revert D29190420: [nnc][tests] Tests and benchmarks for computeSum
Test Plan: revert-hammer

Differential Revision:
D29190420 (21479ad20c)

Original commit changeset: 86246df82098

fbshipit-source-id: ed655497a981783da4c8f13e2d7fec104e3cb184
2021-06-23 06:59:37 -07:00
Bert Maher
21479ad20c [nnc][tests] Tests and benchmarks for computeSum (#60160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60160

Adds a few simple tests and benchmarks for the `computeSum` op
(equivalent to `at::sum`).

The benchmarks test 1D reduction and 2D row and column reduction.  Performance
is in the ballpark of aten (14-15 GB/s) on my skylake devserver for all cases,
and occasionally better (e.g. 256k * 64 row reduction goes from 9 GB/s to 13).

Results (on my skylake-avx512, with turbo disabled):
```
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
Reduce1D/Torch/16777216               4746995 ns    4746722 ns        150 BYTES=14.1379G/s
Reduce1D/Naive/16777216              34063215 ns   34061388 ns         21 BYTES=1.97023G/s
Reduce1D/NativeRfactor/16777216       5057175 ns    5057167 ns        139 BYTES=13.2701G/s
Reduce1D/TeNaive/16777216            33868945 ns   33868851 ns         21 BYTES=1.98143G/s
Reduce1D/TeSplitTail/16777216        33902786 ns   33900436 ns         21 BYTES=1.97959G/s
Reduce1D/TeSplitMask/16777216        33922509 ns   33920604 ns         21 BYTES=1.97841G/s
Reduce1D/TeRfactorV1/16777216         5141150 ns    5141002 ns        135 BYTES=13.0537G/s
Reduce1D/Op/16777216                  5140390 ns    5140091 ns        135 BYTES=13.056G/s
Reduce2DCol/Torch/8/2097152          12824403 ns   12823563 ns         55 BYTES=5.8874G/s
Reduce2DCol/Torch/64/262144           8306873 ns    8306743 ns         83 BYTES=8.20507G/s
Reduce2DCol/Torch/4096/4096           7992364 ns    7992239 ns         87 BYTES=8.3988G/s
Reduce2DCol/OpSchedule/8/2097152/0    4866144 ns    4865766 ns        138 BYTES=15.5161G/s
Reduce2DCol/OpSchedule/64/262144/0   36668978 ns   36666415 ns         19 BYTES=1.85885G/s
Reduce2DCol/OpSchedule/4096/4096/0  155862459 ns  155801266 ns          4 BYTES=430.839M/s
Reduce2DCol/OpSchedule/8/2097152/1    8067683 ns    8061117 ns         85 BYTES=9.36563G/s
Reduce2DCol/OpSchedule/64/262144/1    7496686 ns    7496562 ns         93 BYTES=9.09183G/s
Reduce2DCol/OpSchedule/4096/4096/1    5262821 ns    5262186 ns        131 BYTES=12.7562G/s
Reduce2DCol/OpSchedule/8/2097152/2    6237899 ns    6237210 ns        109 BYTES=12.1044G/s
Reduce2DCol/OpSchedule/64/262144/2    5258012 ns    5257655 ns        127 BYTES=12.9635G/s
Reduce2DCol/OpSchedule/4096/4096/2    5231686 ns    5228241 ns        132 BYTES=12.839G/s
Reduce2DCol/OpSchedule/8/2097152/3   11088573 ns   11087557 ns         62 BYTES=6.80921G/s
Reduce2DCol/OpSchedule/64/262144/3    5338843 ns    5338326 ns        127 BYTES=12.7676G/s
Reduce2DCol/OpSchedule/4096/4096/3    4311617 ns    4308102 ns        162 BYTES=15.5812G/s
Reduce2DRow/Torch/8/2097152           4642244 ns    4641794 ns        151 BYTES=14.4575G/s
Reduce2DRow/Torch/64/262144           4628311 ns    4628245 ns        151 BYTES=14.4999G/s
Reduce2DRow/Torch/4096/4096           4894012 ns    4893316 ns        143 BYTES=13.7177G/s
Reduce2DRow/Torch/262144/64          10469098 ns   10468027 ns         68 BYTES=6.51101G/s
Reduce2DRow/Hand/262144/64            5554380 ns    5554059 ns        126 BYTES=12.2716G/s
Reduce2DRow/OpSchedule/8/2097152/0   33890363 ns   33888931 ns         21 BYTES=1.98026G/s
Reduce2DRow/OpSchedule/64/262144/0   33901317 ns   33899436 ns         21 BYTES=1.97965G/s
Reduce2DRow/OpSchedule/4096/4096/0   33500358 ns   33498815 ns         21 BYTES=2.00381G/s
Reduce2DRow/OpSchedule/262144/64/0   13132231 ns   13131049 ns         53 BYTES=5.19056G/s
Reduce2DRow/OpSchedule/8/2097152/1    5200423 ns    5200025 ns        134 BYTES=12.9055G/s
Reduce2DRow/OpSchedule/64/262144/1    5204428 ns    5204327 ns        133 BYTES=12.8949G/s
Reduce2DRow/OpSchedule/4096/4096/1    8724355 ns    8723370 ns         80 BYTES=7.69488G/s
Reduce2DRow/OpSchedule/262144/64/1 1811861280 ns 1811352083 ns          1 BYTES=37.6279M/s
Reduce2DRow/OpSchedule/8/2097152/2    9169829 ns    9168946 ns         76 BYTES=7.31915G/s
Reduce2DRow/OpSchedule/64/262144/2    9159901 ns    9158560 ns         76 BYTES=7.32747G/s
Reduce2DRow/OpSchedule/4096/4096/2    9217398 ns    9215557 ns         76 BYTES=7.28391G/s
Reduce2DRow/OpSchedule/262144/64/2   10820450 ns   10818998 ns         66 BYTES=6.29979G/s
Reduce2DRow/OpSchedule/8/2097152/3    5227921 ns    5226544 ns        133 BYTES=12.84G/s
Reduce2DRow/OpSchedule/64/262144/3    5194362 ns    5194082 ns        133 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/4096/4096/3    5196080 ns    5195349 ns        134 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/262144/64/3    5235189 ns    5234728 ns        133 BYTES=13.0202G/s
```

ghstack-source-id: 131753875

Test Plan: these tests

Reviewed By: navahgar

Differential Revision: D29190420

fbshipit-source-id: 86246df82098da4f5493d6c4f34a40016d95a9f0
2021-06-22 23:04:09 -07:00