Commit Graph

13 Commits

Author SHA1 Message Date
Wang, Eikan
429a80dded [NNC] Lowering function generates the output buffer with the specified stride (#76529)
Summary:
Pass stride information to lowering function to generate the output bufer with proper memory layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76529

Reviewed By: ZolotukhinM

Differential Revision: D36116712

Pulled By: IvanKobzarev

fbshipit-source-id: d3901f756b3710ecce172d6db3ecb0b7c12fb929
(cherry picked from commit b6cd53c91c01db36ea0e99167dc0ce0ae1d3aa23)
2022-05-04 20:04:22 +00:00
zengk95
1d55518198 Revert "[nnc] Strides to Tensor (#72962)"
This reverts commit 939060925f.

Fixes https://github.com/pytorch/vision/issues/5873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76332
Approved by: https://github.com/seemethere
2022-04-25 19:50:00 +00:00
Ivan Kobzarev
939060925f [nnc] Strides to Tensor (#72962)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72962

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM, cpuhrsch

Differential Revision: D34589306

Pulled By: IvanKobzarev

fbshipit-source-id: ecee5249760ecc0c8b2edb1842b90218899bc944
(cherry picked from commit 9e310c4c67389da30da89126d838ffe3864aba6f)
2022-04-23 19:35:15 +00:00
Nikita Shulga
81d765ef1f Fix sign-compare violations in cpp tests
Prerequisite change for enabling `-Werror=sign-compare` across PyTorch repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75080

Approved by: https://github.com/atalman
2022-04-04 23:05:31 +00:00
Nikita Shulga
43313cbde3 Revert D34647822: [tensorexpr] Add support for aten::stack
Test Plan: revert-hammer

Differential Revision:
D34647822 (954c7e2a77)

Original commit changeset: 3b863c71886c

Original Phabricator Diff: D34647822 (954c7e2a77)

fbshipit-source-id: e9ce06c9c8d7caf0fbb2565f0d99035bad685793
(cherry picked from commit b2ff355e9dbaa4e940fb221254223984c3c8a215)
2022-03-31 04:25:43 +00:00
Hui Guo
954c7e2a77 [tensorexpr] Add support for aten::stack (#73801)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73801

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D34647822

Pulled By: huiguoo

fbshipit-source-id: 3b863c71886c7c6616b16f5d3313079714c8b82a
(cherry picked from commit c71778cf6a5724d26b671bf3ee0478add24990e8)
2022-03-30 21:25:15 +00:00
Mikhail Zolotukhin
3a0165da49 [TensorExpr] Port NNC lowerings to the new registry mechanism. (#65551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551

Previously we had a big switch on Op kind to decide how to lower a given
JIT operator to NNC. This PR changes this switch to a hash table lookup.

Why? This helps us with at least two things:
1) With this approach we can easily check if we know how to handle a
given node in advance - i.e. we can inspect the entire graph and tell
whether it's possible to compile it or not without actually trying to do
that and dying in the middle. This would allow us to, say, provide
user-friendly error messages in AOT workflow.
2) We can switch to use schema instead of op kind to determine correct
lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963)
and using it instead of schema can lead to bugs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148926

Pulled By: ZolotukhinM

fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704
2021-09-30 22:56:18 -07:00
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Bert Maher
10e11dbdcd Reland D29190420: [nnc][tests] Tests and benchmarks for computeSum (#60550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60550

Original commit changeset: ed655497a981

Whatever gcc version OSS Bazel uses wasn't happy move-constructing the
SimpleIREvaluator, so use a unique_ptr instead.

Test Plan:
CI.  Hope that the gcc version used by OSS Bazel build is
happier with this (it should be), since actually testing it locally is
an intractable pain.

Reviewed By: navahgar

Differential Revision: D29333116

fbshipit-source-id: c3e4b5d8c91eb96a43ae5315a01ca0c0f4d4a99d
2021-06-23 10:50:03 -07:00
Anjali Chourdia
b14f19b6fe Revert D29190420: [nnc][tests] Tests and benchmarks for computeSum
Test Plan: revert-hammer

Differential Revision:
D29190420 (21479ad20c)

Original commit changeset: 86246df82098

fbshipit-source-id: ed655497a981783da4c8f13e2d7fec104e3cb184
2021-06-23 06:59:37 -07:00
Bert Maher
21479ad20c [nnc][tests] Tests and benchmarks for computeSum (#60160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60160

Adds a few simple tests and benchmarks for the `computeSum` op
(equivalent to `at::sum`).

The benchmarks test 1D reduction and 2D row and column reduction.  Performance
is in the ballpark of aten (14-15 GB/s) on my skylake devserver for all cases,
and occasionally better (e.g. 256k * 64 row reduction goes from 9 GB/s to 13).

Results (on my skylake-avx512, with turbo disabled):
```
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
Reduce1D/Torch/16777216               4746995 ns    4746722 ns        150 BYTES=14.1379G/s
Reduce1D/Naive/16777216              34063215 ns   34061388 ns         21 BYTES=1.97023G/s
Reduce1D/NativeRfactor/16777216       5057175 ns    5057167 ns        139 BYTES=13.2701G/s
Reduce1D/TeNaive/16777216            33868945 ns   33868851 ns         21 BYTES=1.98143G/s
Reduce1D/TeSplitTail/16777216        33902786 ns   33900436 ns         21 BYTES=1.97959G/s
Reduce1D/TeSplitMask/16777216        33922509 ns   33920604 ns         21 BYTES=1.97841G/s
Reduce1D/TeRfactorV1/16777216         5141150 ns    5141002 ns        135 BYTES=13.0537G/s
Reduce1D/Op/16777216                  5140390 ns    5140091 ns        135 BYTES=13.056G/s
Reduce2DCol/Torch/8/2097152          12824403 ns   12823563 ns         55 BYTES=5.8874G/s
Reduce2DCol/Torch/64/262144           8306873 ns    8306743 ns         83 BYTES=8.20507G/s
Reduce2DCol/Torch/4096/4096           7992364 ns    7992239 ns         87 BYTES=8.3988G/s
Reduce2DCol/OpSchedule/8/2097152/0    4866144 ns    4865766 ns        138 BYTES=15.5161G/s
Reduce2DCol/OpSchedule/64/262144/0   36668978 ns   36666415 ns         19 BYTES=1.85885G/s
Reduce2DCol/OpSchedule/4096/4096/0  155862459 ns  155801266 ns          4 BYTES=430.839M/s
Reduce2DCol/OpSchedule/8/2097152/1    8067683 ns    8061117 ns         85 BYTES=9.36563G/s
Reduce2DCol/OpSchedule/64/262144/1    7496686 ns    7496562 ns         93 BYTES=9.09183G/s
Reduce2DCol/OpSchedule/4096/4096/1    5262821 ns    5262186 ns        131 BYTES=12.7562G/s
Reduce2DCol/OpSchedule/8/2097152/2    6237899 ns    6237210 ns        109 BYTES=12.1044G/s
Reduce2DCol/OpSchedule/64/262144/2    5258012 ns    5257655 ns        127 BYTES=12.9635G/s
Reduce2DCol/OpSchedule/4096/4096/2    5231686 ns    5228241 ns        132 BYTES=12.839G/s
Reduce2DCol/OpSchedule/8/2097152/3   11088573 ns   11087557 ns         62 BYTES=6.80921G/s
Reduce2DCol/OpSchedule/64/262144/3    5338843 ns    5338326 ns        127 BYTES=12.7676G/s
Reduce2DCol/OpSchedule/4096/4096/3    4311617 ns    4308102 ns        162 BYTES=15.5812G/s
Reduce2DRow/Torch/8/2097152           4642244 ns    4641794 ns        151 BYTES=14.4575G/s
Reduce2DRow/Torch/64/262144           4628311 ns    4628245 ns        151 BYTES=14.4999G/s
Reduce2DRow/Torch/4096/4096           4894012 ns    4893316 ns        143 BYTES=13.7177G/s
Reduce2DRow/Torch/262144/64          10469098 ns   10468027 ns         68 BYTES=6.51101G/s
Reduce2DRow/Hand/262144/64            5554380 ns    5554059 ns        126 BYTES=12.2716G/s
Reduce2DRow/OpSchedule/8/2097152/0   33890363 ns   33888931 ns         21 BYTES=1.98026G/s
Reduce2DRow/OpSchedule/64/262144/0   33901317 ns   33899436 ns         21 BYTES=1.97965G/s
Reduce2DRow/OpSchedule/4096/4096/0   33500358 ns   33498815 ns         21 BYTES=2.00381G/s
Reduce2DRow/OpSchedule/262144/64/0   13132231 ns   13131049 ns         53 BYTES=5.19056G/s
Reduce2DRow/OpSchedule/8/2097152/1    5200423 ns    5200025 ns        134 BYTES=12.9055G/s
Reduce2DRow/OpSchedule/64/262144/1    5204428 ns    5204327 ns        133 BYTES=12.8949G/s
Reduce2DRow/OpSchedule/4096/4096/1    8724355 ns    8723370 ns         80 BYTES=7.69488G/s
Reduce2DRow/OpSchedule/262144/64/1 1811861280 ns 1811352083 ns          1 BYTES=37.6279M/s
Reduce2DRow/OpSchedule/8/2097152/2    9169829 ns    9168946 ns         76 BYTES=7.31915G/s
Reduce2DRow/OpSchedule/64/262144/2    9159901 ns    9158560 ns         76 BYTES=7.32747G/s
Reduce2DRow/OpSchedule/4096/4096/2    9217398 ns    9215557 ns         76 BYTES=7.28391G/s
Reduce2DRow/OpSchedule/262144/64/2   10820450 ns   10818998 ns         66 BYTES=6.29979G/s
Reduce2DRow/OpSchedule/8/2097152/3    5227921 ns    5226544 ns        133 BYTES=12.84G/s
Reduce2DRow/OpSchedule/64/262144/3    5194362 ns    5194082 ns        133 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/4096/4096/3    5196080 ns    5195349 ns        134 BYTES=12.9203G/s
Reduce2DRow/OpSchedule/262144/64/3    5235189 ns    5234728 ns        133 BYTES=13.0202G/s
```

ghstack-source-id: 131753875

Test Plan: these tests

Reviewed By: navahgar

Differential Revision: D29190420

fbshipit-source-id: 86246df82098da4f5493d6c4f34a40016d95a9f0
2021-06-22 23:04:09 -07:00