Commit Graph

234 Commits

Author SHA1 Message Date
jjsjann123
c9c402eae9 [nvfuser_upstream_push] Reland: nvfuser code base bump 060822 (#79406)
Landing reverted PR #79147.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79406
Approved by: https://github.com/davidberard98
2022-06-16 17:52:21 +00:00
Michael Andreas Dagitses
acd072967a canonicalize includes of form <aten/src/ATen/...>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78033

This was never intended to be supported.

@override-unit-failures
(Note: this ignores all push blocking failures!)

Differential Revision: [D36567054](https://our.internmc.facebook.com/intern/diff/D36567054/)

Approved by: https://github.com/kit1980
2022-06-16 17:46:45 +00:00
Ivan Yashchuk
e10b762537 Enable torch._refs.var for nvFuser executor (#79517)
This PR adds variance function with correction argument to nvFuser.

Now it's possible to run
```py
import torch
import torch._refs
from torch._prims.executor import make_traced

def foo1(a):
    return torch._refs.var(a, keepdim=False, unbiased=False)

def foo2(a):
    return torch._refs.var(a, keepdim=False, correction=2)

a = torch.randn(3, 3, device='cuda')
make_traced(foo1)(a, executor="nvfuser")
make_traced(foo2)(a, executor="nvfuser")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79517
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
2022-06-14 23:08:53 +00:00
Ivan Yashchuk
8895862744 Enable torch._refs.mean for nvFuser executor (#79444)
This PR fixes a bug with `broadcast_in_dim` leading to the situation when reduction ops were not allowed to be used before `broadcast_in_dim`.

With this PR it's possible to run
```py
import torch
import torch._refs
from torch._prims.executor import make_traced

def foo(a):
    return torch._refs.mean(a, keepdim=False)

a = torch.randn(3, 3, device='cuda')
make_traced(foo)(a, executor="nvfuser")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79444
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
2022-06-14 19:42:07 +00:00
Michael Andreas Dagitses
52a5266aab turn on -Werror=unused-but-set-variable
Summary:
Also fix the one violation.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79305

Approved by: https://github.com/malfet
2022-06-13 20:23:50 +00:00
Michael Andreas Dagitses
606b234336 turn on -Werror=unused-function in our Bazel CPU build
Summary:
We also fix any existing issues. Note that we only do this for the CPU
build because nvcc is considered a C++ toolchain but it does not have
the same flag support. Adding flags to the GPU build will cause nvcc
errors.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 22:11:54 +00:00
PyTorch MergeBot
d28e9e145b Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)"
This reverts commit 49c41b87a2.

Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk 49c41b87a2
2022-06-10 20:55:10 +00:00
PyTorch MergeBot
bcd7a20953 Revert "turn on -Werror=unused-function in our Bazel CPU build"
This reverts commit 67d313a032.

Reverted https://github.com/pytorch/pytorch/pull/79154 on behalf of https://github.com/malfet due to Breaks bazel build: 67d313a032
2022-06-10 20:43:03 +00:00
jjsjann123
49c41b87a2 [nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
2022-06-10 19:37:42 +00:00
Michael Andreas Dagitses
67d313a032 turn on -Werror=unused-function in our Bazel CPU build
Summary:
We also fix any existing issues. Note that we only do this for the CPU
build because nvcc is considered a C++ toolchain but it does not have
the same flag support. Adding flags to the GPU build will cause nvcc
errors.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 18:30:08 +00:00
Michael Andreas Dagitses
f96d96a7fc turn on -Werror=type-limits in our Bazel CPU build
Summary:
We also fix any existing issues.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79139

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 10:04:08 +00:00
jjsjann123
462874f418 adding a quick link to nvfuser README.md in jit doc for 1.12 release (#78160)
adding a link to github 1.12 release branch nvfuser README.md in jit doc

Note that this PR is intended to be cherry-picked by 1.12 release, we'll have a follow up PR to update the link once this PR is merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78160
Approved by: https://github.com/davidberard98
2022-06-09 17:28:17 +00:00
jjsjann123
9e52ad28c9 [nvfuser_upstream_push] nvfuser code base bump 052422 (#78244)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440

Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
2022-06-07 17:30:51 -07:00
David Berard
38bc10ae25 retry - enable NVFuser by default
Enable NVFuser in OSS.

Retry of #77213, because it was breaking torchvision tests.

Fix in #77471 has been verified by jjsjann123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77579

Approved by: https://github.com/eellison, https://github.com/malfet, https://github.com/atalman, https://github.com/seemethere
2022-05-20 14:21:18 +00:00
jjsjann123
6583c0384b fixing trivial reduction & broadcast scheduling (#77884)
cherry-picked fixes from https://github.com/csarofeen/pytorch/pull/1714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77884
Approved by: https://github.com/csarofeen, https://github.com/davidberard98
2022-05-20 02:00:42 +00:00
jjsjann123
17fbb85734 [nvfuser] prevent spamming warning message (#77777)
updating TORCH_WARN to TORCH_WARN_ONCE to prevent spamming the log
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77777
Approved by: https://github.com/davidberard98
2022-05-19 20:43:14 +00:00
jjsjann123
a2802ad0b9 Upstream master bump 0513 (#77471)
Updating nvfuser code base.

This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471
Approved by: https://github.com/seemethere, https://github.com/eellison
2022-05-18 11:48:50 -07:00
Xiang Gao
4eec865f58 [nvFuser] Improving bitwise ops support (#77158)
- Some renaming to better match PyTorch API:
  - `lshift` -> `bitwise_left_shift`
  - `rshift` -> `bitwise_right_shift`
  - `andOp` -> `bitwise_and`
  - `orOp` -> `bitwise_or`
  - `xorOp` -> `bitwise_xor`
  - `notOp` -> `bitwise_not`
- Fix type inferences and type checking of these ops
- Add `bitwise_*` to parser and python frontend
- Improve test coverage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77158
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123
2022-05-18 17:21:34 +00:00
PyTorch MergeBot
2a905aef09 Revert "enable NVFuser by default"
This reverts commit 24f7dcd816.

Reverted https://github.com/pytorch/pytorch/pull/77213 on behalf of https://github.com/davidberard98
2022-05-16 18:23:39 +00:00
David Berard
e175065c4e [NVFuser] fix force-disable flag
This prevents the std::call_once() check from erroring if:

* PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1
* PYTORCH_JIT_ENABLE_NVFUSER=1
* user has not set a flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77395

Approved by: https://github.com/eellison
2022-05-14 19:31:32 +00:00
David Berard
36f7a6cc4a [NVFuser] don't decompose conv2d if we don't have shape info
Sometimes bias won't have shape info (e.g. in the added test, conv gets run two times in a loop, each with different shapes). In that case we should just skip decomposition instead of erroring out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77440

Approved by: https://github.com/jjsjann123
2022-05-13 22:39:43 +00:00
David Berard
24f7dcd816 enable NVFuser by default
Enable NVFuser in OSS.

Tests are passing, and we've also run tests in [torchvision](https://github.com/pytorch/vision/pull/5959) and [torchaudio](https://github.com/pytorch/audio/pull/2372)

Retry of #76006, because that PR had GH1/ghstack issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77213

Approved by: https://github.com/eellison
2022-05-11 19:59:31 +00:00
David Berard
6fd14ba9db [NVFuser] Add environment variable to force disable NVFuser
PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1 will force NVFuser to be disabled,
regardless of other environment variables or values set at runtime. It
will be used for guarding certain parts of the internal rollout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77168

Approved by: https://github.com/jjsjann123, https://github.com/eellison
2022-05-11 16:12:19 +00:00
David Berard
3c2e0dc657 [NVFuser] assert that vectors are the same size in translateSingleWelford
Before, sometimes out_root.size() < in_root.size(), which would result
in a segfault while accessing out_root[i]. If, instead, we just error
out here, an exception will be thrown and then we'll run the fallback
instead of completely erroring out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77010

Approved by: https://github.com/eellison, https://github.com/jjsjann123
2022-05-11 15:48:44 +00:00
Kevin Stephano
752d496c91 Fix broadcast_in_dim support in NVFuser Frontend (#76790)
This PR primarily addresses augmenting the frontend to properly support `broadcast_in_dim`.  This required make a new version of the `define_tensor()` that takes in the `size` and `strides` of input tensors in order to properly determine broadcasts.

This PR also has a fix for the `python_example.py` that broke when a new argument was added to reductions to allow the user to specify an output Data Type.

`define_tensor()` Interface Example:

```
fusion2 = Fusion()

input1 = torch.ones(1, 1, 4, device='cuda')
input2 = torch.ones(2, 3, 4, device='cuda')

with FusionDefinition(fusion2) as fd :
    t0 = fd.define_tensor(sizes=input1.size(), strides=input1.stride())
    t1 = fd.define_tensor(sizes=input2.size(), strides=input2.stride())

    fd.add_input(t0)
    fd.add_input(t1)

    t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [0, 1, 2])
    print("Broadcast TensorView", t0_b)
    t2 = fd.Ops.add(t0_b, t1)

    fd.add_output(t2)
```
Print statement of defined broadcast tensor:

```
Broadcast TensorView T2_l[ sbS6{1}, sbS7{1}, iS8{i2} ] DataType: float Contiguity: ttt
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76790
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
2022-05-10 18:13:22 +00:00
jjsjann123
489818e7c6 disabling squeeze/unsqueeze; disabling BN/BN_BWD for perf concern (#77017)
Fixes #76883 (via disabling squeeze/unsqueeze)

Disabling BN fwd/bwd for our perf concern. I need to update our python tests. Awaiting build to finish so I can update tests accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77017
Approved by: https://github.com/csarofeen, https://github.com/davidberard98
2022-05-09 22:57:20 +00:00
jjsjann123
b4f3f9c651 Torchvision patch (#77001)
Fixes #76791

Note that this is a hot patch so we get to run upstream tests. I'm doing proper fix in our local repo and will update upstream code once those are merged/reviewed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77001
Approved by: https://github.com/davidberard98
2022-05-09 16:53:23 +00:00
Xiang Gao
104f0bf09e [Reland] Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend (#76769)
This reverts commit 4bb5944133.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76769
Approved by: https://github.com/csarofeen, https://github.com/mruberry
2022-05-07 21:26:00 +00:00
David Berard
6c615a21a0 [NVFuser] prep for on-by-default
1. fix tests that expected nvfuser off-by-default behavior
2. skip nvfuser if getExecutorMode() == false

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76937

Approved by: https://github.com/eellison
2022-05-06 18:18:53 +00:00
sanchitintel
4ee29d6033 [Reland take-2] Add JIT graph fuser for oneDNN Graph API (v0.5)
Re-landing #68111/#74596

## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).

On the basis of #50256, the below improvements are included:

 * The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
 * The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.

 ### User API:
The optimization pass is disabled by default. Users could enable it by:

```
 torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.

 ### Performance:
 [pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:

 * SkyLake 8180 (1 socket of 28 cores):
   ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)
* SkyLake 8180 (single thread):
   ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)
   * By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
   ** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops

 ### Directory structure of the integration code
 Fuser-related code is placed under:

 ```
 torch/csrc/jit/codegen/onednn/
 ```

 Optimization pass registration is done in:

 ```
 torch/csrc/jit/passes/onednn_graph_fuser.h
 ```

 CMake for the integration code is in:

 ```
 caffe2/CMakeLists.txt
 cmake/public/mkldnn.cmake
 cmake/Modules/FindMKLDNN.cmake
 ```

 ## Limitations
 * In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
 * We have only optimized the inference use-case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
2022-05-05 16:57:03 +00:00
CodemodService FBSourceClangFormatLinterBot
fa3e0d5f4c [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT (#76802)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76802

Reviewed By: ivanmurashko

Differential Revision: D36124688

fbshipit-source-id: d6921d373500ec56bf20db073030df781f635f56
(cherry picked from commit 8047422f3c42c095065ab1622c898a8c742de2f1)
2022-05-04 09:52:23 +00:00
David Berard
e33f3229a2 [NVFuser] environment variable to turn nvfuser on or off (#76485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485

Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for
controlling whether or not nvfuser is enabled. This required changing
the PassManager behavior to support the case where nvfuser gets enabled
by default when PYTORCH_JIT_ENABLE_NVFUSER=1.

Previously the solution for turning nvfuser on or off was to use the
PassManager to register or un-register the pass. That works fine if the
pass starts of _disabled_, but causes issues once we try to enable the
pass by default.

The main issue with enabling by default is with the validation check to
see whether NVFuser can be turned on. The check relies on
at::globalContext().hasCUDA(), which requires CUDAHooks to be registered
before hasCUDA() wil work correctly. At static initialization time it's
difficult to ensure that CUDAHooks will be registered _before_ we
attempt to register the nvfuser pass. In OSS it worked fine, but in
internal builds it would fail on ROCm builds.

To fix this, we switch the control of NVFuser enablement to a check in
the pass. i.e. previously, we enabled/disabled nvfuser by registering or
de-registering the pass in pass manager; now, the pass is always
registered in pass manager, and enablement is done by a check within the
nvfuser pass.

Remaining TODO: Connect this with NNC so that in cases where NNC is
available but not NVFuser (i.e. on AMD gpus), NNC can be turned on
automatically.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D35982618

Pulled By: davidberard98

fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60
(cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)
2022-05-03 23:05:40 +00:00
PyTorch MergeBot
4bb5944133 Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"
This reverts commit 92d10decc4.

Reverted https://github.com/pytorch/pytorch/pull/76598 on behalf of https://github.com/malfet
2022-05-03 19:53:28 +00:00
Xiang Gao
92d10decc4 Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend
Fixes: https://github.com/csarofeen/pytorch/issues/1632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76598
Approved by: https://github.com/csarofeen, https://github.com/mruberry
2022-05-03 16:31:40 +00:00
jjsjann123
d23619b030 Permutation extended
Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time.

The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)`

1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation;
2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output;
3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous.

By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`.

Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5):
1. different rank & same permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1])  # stride (1, wc, c)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
2. different rank & compatible permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
3. different rank & compatible permutation with broadcasting
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1)  # stride (1, 1, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
4. different rank & in-compatible permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w).cuda()  # stride (w, 1)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, wc, c, 1)  # nvfuser outputs contiguous tensor
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # TI preserves memory format of LHS operand
```
5. different rank & in-compatible permutation
```
    t0 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # nvfuser preserves memory format of highest rank tensors
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, hw, w, 1)  # TensorIterator preserves memory format of LHS operand
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
2022-05-02 22:09:56 +00:00
CodemodService FBSourceClangFormatLinterBot
461cc0a960 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: adamjernst

Differential Revision: D36061557

fbshipit-source-id: 61ea6017a3550dbbb13b7b288a1d537253c25fe2
(cherry picked from commit 2ea13c57e96763318859159b6441c02f7c72caf2)
2022-05-02 22:07:42 +00:00
Peter Bell
2e480fc2db Cleanup ATen-core forward declarations
I noticed that when `SymInt` was introduced, `jit_type_base.h` was
added as an include to the `Operator.h` template which is supposed to
be kept extremely clean and only use forward declarations. Also,
that forward declarations for `OptionalArrayRef` were missing.

So, I've refactored the forward declarations into
`ATen/core/ATen_fwd.h` and cleaned up some of the `c10`
headers that were masking these missing declarations. I've also
re-generated the pre-compiled header so `SymInt` is included.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76576
Approved by: https://github.com/albanD
2022-05-02 14:50:48 +00:00
Ryan Spring
33be4c94c0 [Nvfuser] Add cast support between double and half types
Fixes `RuntimeError: Illegal Cast value from  DataType: __half to DataType: double`

Example:
```
with FusionDefinition(fusion) as fd :
    t0 = fd.define_tensor(2, DataType.Half)
    t1 = fd.define_tensor(2, DataType.Double)

    fd.add_input(t0)
    fd.add_input(t1)

    t2 = fd.Ops.add(t0, t1)
    t5 = fd.Ops.relu(t2)

    fd.add_output(t5)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76605
Approved by: https://github.com/csarofeen
2022-05-01 19:04:48 +00:00
jjsjann123
100e72f54b Nvfuser faster fallback
Follow up to #76505

Addressing https://github.com/pytorch/pytorch/pull/76505#discussion_r861260818 to further improve fallback perf during compilation failure. This allows us to reuse fallback instead re-constructing new code every time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76604
Approved by: https://github.com/davidberard98
2022-04-30 05:59:18 +00:00
PyTorch MergeBot
3dcd67a1b3 Revert "[Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1)"
This reverts commit 8b11d81058.

Reverted https://github.com/pytorch/pytorch/pull/74596 on behalf of https://github.com/janeyx99
2022-04-29 15:40:17 +00:00
chunyuan
8b11d81058 [Re-landing 68111] Add JIT graph fuser for oneDNN Graph API (Preview4.1)
Re-landing https://github.com/pytorch/pytorch/pull/68111

## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).

On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:

- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.

### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```

### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

  ![image](https://user-images.githubusercontent.com/65992142/151162305-05e44425-a24e-4d5e-94e1-743b40b87a8c.png)

- SkyLake 8180 (single thread):

  ![image](https://user-images.githubusercontent.com/65992142/151162528-69f90b79-d08d-46b8-8775-d80a6ccbce8a.png)
 \* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
  \** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops

### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```

Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```

CMake for the integration code is:
```
caffe2/CMakeLists.txt
```

## Limitations

- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74596
Approved by: https://github.com/malfet
2022-04-29 01:01:33 +00:00
jjsjann123
ac31e5d4a3 Add a matching lerp implementation to eager mode. (#1612)
Fixes part of #76046

Add a matching lerp to eager mode.

Co-authored-by: jjsjann123 <alex.jann2012@gmail.com>
Co-authored-by: jjsjann123 <jiej@nvidia.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76459
Approved by: https://github.com/ngimel
2022-04-28 23:37:01 +00:00
David Berard
e52dc9888b Retry - [NVFuser] always use fallback if fusion fails
Retry of #75983. The change is to handle cases where attr::cache_id is
not set. This can happen if compilation fails.

Original message:

1) remember when fusions fail; and on subsequent runs, always take the fallback.
2) during the first fallback, cache the Code object.

On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows:
* Original (always attempt fusion): 25ms
* Always take fallback after first failure: 0.79ms
* Always take fallback + cache Code object: 0.62ms
* Eager: 0.58ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76505
Approved by: https://github.com/eellison
2022-04-28 20:38:37 +00:00
Ryan Spring
e9f17da2cf Nvfuser - Type Promotion Fix
Fix Type Promotion failures in [Issue 76046](https://github.com/pytorch/pytorch/issues/76046)

1. Updated nvfuser type promotion rule for codegen kernel;
2. Updated casting for output of nvfuser kernel to respect profiling/TorchScript scalar type;
3. Updated type_inference.cpp to only update device/scalar_type when profiling information is missing.

Additional Type Promotion Fixes:
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float32
-  test_nvfuser_correctness_softmax_with_dtype_cuda_bfloat16
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float16
-  test_nvfuser_correctness_softmax_with_dtype_cuda_float32
-  test_nvfuser_correctness_log_softmax_dtype_cuda_bfloat16
-  test_nvfuser_correctness_log_softmax_dtype_cuda_bool
-  test_nvfuser_correctness_log_softmax_dtype_cuda_float16
-  test_nvfuser_correctness_log_softmax_dtype_cuda_float32
-  test_nvfuser_correctness_sum_cuda_int32
-  test_nvfuser_correctness_sum_to_size_cuda_int32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76343
Approved by: https://github.com/jjsjann123, https://github.com/mruberry
2022-04-28 16:08:38 +00:00
Kevin Stephano
f51516df1c Adding broadcast_in_dim and non-contiguous Tensor support NVFuser Python Frontend
Adding new features.

1. `broadcast_in_dim` support and example.
2. Adding non-contiguous `TensorView` support and example.

`broadcast_in_dim` example:
```
with FusionDefinition(fusion) as fd :
    t0 = fd.define_tensor(1)
    t1 = fd.define_tensor(3)

    fd.add_input(t0)
    fd.add_input(t1)

    t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [1])
    t2 = fd.Ops.add(t0_b, t1)

    fd.add_output(t2)
```

Non-contiguous tensor support example:

```
with FusionDefinition(fusion) as fd :
    t0 = fd.define_tensor(3, [False, False, False])
    t1 = fd.define_tensor(3, [True, True, True])

    fd.add_input(t0)
    fd.add_input(t1)
    print("Input1 Contiguity:", t0)
    print("Input2 Contiguity:", t1)

    t2 = fd.Ops.add(t0, t1)

    print("Output Contiguity:", t2, "\n")
    fd.add_output(t2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76484
Approved by: https://github.com/mruberry
2022-04-28 08:36:57 +00:00
PyTorch MergeBot
bfb39e577c Revert "[NVFuser] always use fallback if fusion fails"
This reverts commit da984c507c.

Reverted https://github.com/pytorch/pytorch/pull/75983 on behalf of https://github.com/davidberard98
2022-04-26 15:21:23 +00:00
Kevin Stephano
b17b2b1cc7 Add NVFuser Python Frontend
New functionality.

1. Adds Pybind11 bindings for NVFuser.
2. Requires a build file change and JIT python file change outside of NVFuser's code area.

Example:
```
import torch

from torch._C._nvfuser import Fusion, FusionDefinition

# Construct and Define Fusion
fusion = Fusion()

with FusionDefinition(fusion) as fd :
    t0 = fd.define_tensor(3)
    t1 = fd.define_tensor(1)
    s0 = fd.define_scalar()

    fd.add_input(t0)
    fd.add_input(t1)
    fd.add_input(s0)

    c0 = fd.define_constant(3.0)

    t1_b = fd.Ops.broadcast(t1, [True, True, False])
    t2 = fd.Ops.add(t0, t1)
    t3 = fd.Ops.mul(t2, c0)
    t4 = fd.Ops.mul(t3, s0)
    t5 = fd.Ops.relu(t4)
    t6 = fd.Ops.sum(t5, [-1], False)

    fd.add_output(t6)

fusion.print_ir()

# Execute Fusion
input1 = torch.ones(2, 4, 8, device='cuda')
input2 = torch.ones(8, device='cuda')

# Kernel compilation should be cached for the 2nd iteration
# with input tensors of the same shape
for _ in range(5) :
    outputs = fusion.execute([input1, input2, 2.0])

print(outputs[0])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76353
Approved by: https://github.com/csarofeen, https://github.com/mruberry
2022-04-26 06:10:19 +00:00
jjsjann123
e48b29b1fb patching 11.1 ptxas issue
Fixes #75708

`--ptxas-options` only passes its immediate argument to ptxas. So we should have put that in front of every ptxas argument.

It's actually strange how this worked in CUDA TK 11.6. I'm following up with nvrtc team on this internally, meanwhile we should merge this PR to avoid register failures in generated kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76226
Approved by: https://github.com/davidberard98
2022-04-25 22:26:24 +00:00
David Berard
f36d348f75 [NVFuser] multithreading nvfuser test
1) add multithreading tests
2) make IrParser thread safe with std::call_once (previously, registerJitOperator could get called twice simultaneously and segfault)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76259

Approved by: https://github.com/jjsjann123
2022-04-25 21:48:50 +00:00
David Berard
da984c507c [NVFuser] always use fallback if fusion fails
1) remember when fusions fail; and on subsequent runs, always take the fallback.
2) during the first fallback, cache the Code object.

On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows:
* Original (always attempt fusion): 25ms
* Always take fallback after first failure: 0.79ms
* Always take fallback + cache Code object: 0.62ms
* Eager: 0.58ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75983

Approved by: https://github.com/jjsjann123
2022-04-25 20:48:47 +00:00