Commit Graph

77 Commits

Author SHA1 Message Date
PyTorch MergeBot
d28e9e145b Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)"
This reverts commit 49c41b87a2.

Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk 49c41b87a2
2022-06-10 20:55:10 +00:00
jjsjann123
49c41b87a2 [nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
2022-06-10 19:37:42 +00:00
jjsjann123
9e52ad28c9 [nvfuser_upstream_push] nvfuser code base bump 052422 (#78244)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440

Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
2022-06-07 17:30:51 -07:00
jjsjann123
6583c0384b fixing trivial reduction & broadcast scheduling (#77884)
cherry-picked fixes from https://github.com/csarofeen/pytorch/pull/1714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77884
Approved by: https://github.com/csarofeen, https://github.com/davidberard98
2022-05-20 02:00:42 +00:00
jjsjann123
a2802ad0b9 Upstream master bump 0513 (#77471)
Updating nvfuser code base.

This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471
Approved by: https://github.com/seemethere, https://github.com/eellison
2022-05-18 11:48:50 -07:00
Xiang Gao
4eec865f58 [nvFuser] Improving bitwise ops support (#77158)
- Some renaming to better match PyTorch API:
  - `lshift` -> `bitwise_left_shift`
  - `rshift` -> `bitwise_right_shift`
  - `andOp` -> `bitwise_and`
  - `orOp` -> `bitwise_or`
  - `xorOp` -> `bitwise_xor`
  - `notOp` -> `bitwise_not`
- Fix type inferences and type checking of these ops
- Add `bitwise_*` to parser and python frontend
- Improve test coverage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77158
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123
2022-05-18 17:21:34 +00:00
David Berard
36f7a6cc4a [NVFuser] don't decompose conv2d if we don't have shape info
Sometimes bias won't have shape info (e.g. in the added test, conv gets run two times in a loop, each with different shapes). In that case we should just skip decomposition instead of erroring out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77440

Approved by: https://github.com/jjsjann123
2022-05-13 22:39:43 +00:00
jjsjann123
b010c3451c nvfuser opinfo test fixes masked_var/std (#77273)
Enables guard mode in opinfo tests.
Fixes opinfo failures for
    test_nvfuser_correctness__masked_var_cuda_xxxx
    test_nvfuser_correctness__masked_std_cuda_xxxx

The root cause of the failure is that tracing changes stride properties and causes nvfuser to use wrong kernel and generate wrong results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77273
Approved by: https://github.com/davidberard98
2022-05-12 05:04:56 +00:00
David Berard
949cbf1d65 [NVFuser] Opinfos for extremal values in binary ufuncs
Added slow tests for comparing the eager & fused outputs for given extremal inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75917

Approved by: https://github.com/jjsjann123, https://github.com/eellison
2022-05-10 03:22:20 +00:00
jjsjann123
489818e7c6 disabling squeeze/unsqueeze; disabling BN/BN_BWD for perf concern (#77017)
Fixes #76883 (via disabling squeeze/unsqueeze)

Disabling BN fwd/bwd for our perf concern. I need to update our python tests. Awaiting build to finish so I can update tests accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77017
Approved by: https://github.com/csarofeen, https://github.com/davidberard98
2022-05-09 22:57:20 +00:00
jjsjann123
b4f3f9c651 Torchvision patch (#77001)
Fixes #76791

Note that this is a hot patch so we get to run upstream tests. I'm doing proper fix in our local repo and will update upstream code once those are merged/reviewed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77001
Approved by: https://github.com/davidberard98
2022-05-09 16:53:23 +00:00
Xiang Gao
104f0bf09e [Reland] Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend (#76769)
This reverts commit 4bb5944133.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76769
Approved by: https://github.com/csarofeen, https://github.com/mruberry
2022-05-07 21:26:00 +00:00
PyTorch MergeBot
4bb5944133 Revert "Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend"
This reverts commit 92d10decc4.

Reverted https://github.com/pytorch/pytorch/pull/76598 on behalf of https://github.com/malfet
2022-05-03 19:53:28 +00:00
Xiang Gao
92d10decc4 Add atan2 isfinite isinf isnan isneginf isposinf isreal to nvfuser and its frontend
Fixes: https://github.com/csarofeen/pytorch/issues/1632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76598
Approved by: https://github.com/csarofeen, https://github.com/mruberry
2022-05-03 16:31:40 +00:00
jjsjann123
d23619b030 Permutation extended
Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time.

The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)`

1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation;
2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output;
3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous.

By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`.

Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5):
1. different rank & same permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1])  # stride (1, wc, c)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
2. different rank & compatible permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
3. different rank & compatible permutation with broadcasting
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1)  # stride (1, 1, 1)
    out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c) preserving memory format of t0
```
4. different rank & in-compatible permutation
```
    t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    t1 = torch.randn(h, w).cuda()  # stride (w, 1)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, wc, c, 1)  # nvfuser outputs contiguous tensor
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # TI preserves memory format of LHS operand
```
5. different rank & in-compatible permutation
```
    t0 = torch.randn(c, h, w).cuda()  # stride (hw, w, 1)
    t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2])  # stride (hwc, 1, wc, c)
    jit_out = scripted_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, 1, wc, c)  # nvfuser preserves memory format of highest rank tensors
    eager_out = eager_add(t0, t1)  # stride (hwc, 1, wc, c)  # stride (hwc, hw, w, 1)  # TensorIterator preserves memory format of LHS operand
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
2022-05-02 22:09:56 +00:00
Elias Ellison
bcee215d2b [Testing CI] test exact layout on nvfuser tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76393

Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-04-28 00:18:15 +00:00
Elias Ellison
0d7be81c9c [JIT] Add Context Manager to force strict fusion
Fixes https://github.com/pytorch/pytorch/issues/75464 Adds a context manager that will throw if the ops in the context are not fused.

API is :
```
with torch.jit.strict_fusion():
    ...
```

A few TODOs:
[+] Compose/figure out how to do with autodiff - right now it will run on autodiff as well
[+] Support all of the nvfuser operators that are added in guarding
[+] Figure out what to do with control flow that isn't taken (right now it will just error). this is probably a source of the original issue :/  - will just error
[+] (After those are figured out) add to docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75777
Approved by: https://github.com/davidberard98
2022-04-25 16:08:57 +00:00
David Berard
1324410f2e [JIT] Reuse traced fn for jit opinfos
Previously, jit opinfos would only run the traced function once. This is a problem for NNC and NVFuser, where the fused implementation only runs on the second invocation.

This caches the traced function and calls the cached implementation, so that subsequent calls actually perform fusion and use the fused implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76000

Approved by: https://github.com/eellison
2022-04-22 20:14:29 +00:00
David Berard
cd0fdccaef Enable windows tests for nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75190

Approved by: https://github.com/eellison, https://github.com/jjsjann123
2022-04-19 12:36:50 +00:00
David Berard
ebb60a8b2f [NVFuser] don't decompose linear if we don't have shape info
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75770

Approved by: https://github.com/jjsjann123, https://github.com/robieta
2022-04-18 14:24:37 +00:00
jjsjann123
692ebc8d8b baby steps on patching inf/nan behavior & aten::amin support in nvfuser
Fixes #75622

1. Instead of getting max/min_value for reduction init value, we go with (-)infinity instead so we can properly preserve inf inputs;
2. Adding inf/(-)inf/nan for float value.
3. Adding aten::amin in nvfuser (@kevinstephano @rdspring1 for review)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75646
Approved by: https://github.com/rdspring1, https://github.com/kevinstephano, https://github.com/ngimel
2022-04-13 15:51:17 +00:00
David Berard
790cc8f259 [JIT] nvfuser test - use only major & minor versions
torch.version.cuda.split('.') can have 2 or 3 elements depending on whether the version string contains a patch number. This updates the test so it doesn't error out when the version has > 2 parts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75706

Approved by: https://github.com/ngimel
2022-04-13 15:14:12 +00:00
David Berard
a305c078da [JIT] Prevent nvfuser registration on ROCm
Previously, cuda_graph_fuser.h registration of the nvfuser pass used `at::globalContext().hasHIP()` to check whether we were using ROCm/HIP. However, I don't think that check actually does anything; on the ROCm CI jobs the registration would still succeed.

Instead it's replaced with `#ifdef USE_ROCM`.

Verified this by enabling the NVFuser tests on ROCm and running in CI.

Before this change: the NVFuser test in CI on ROCm would throw really long and complex errors.

Now, it errors out immediately when trying to enable nvfuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75284

Approved by: https://github.com/eellison
2022-04-12 20:06:14 +00:00
jiej
0203341bbd patching clamp for one sided clamp
Fixes #75088

The solution is just to avoid putting random value for non-specified clamp as pointed out in https://github.com/pytorch/pytorch/issues/75088#issuecomment-1093410036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75558
Approved by: https://github.com/ngimel
2022-04-12 03:02:32 +00:00
jjsjann123
f7e7af80e0 disabling reshape
Fixes #75282

Temporarily disables reshape to avoid codegen failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75539
Approved by: https://github.com/davidberard98
2022-04-12 02:43:45 +00:00
jjsjann123
2d5e4cff85 disabling view
Disabling view to avoid codegen errors as we resolve them internally.
This is currently done via simply stop the non-alias transformation for view op in fusion pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75235
Approved by: https://github.com/davidberard98
2022-04-07 01:00:04 +00:00
jiej
aac4d6cd63 updating nvfuser tests
Re-enabled the failing test `test_category_rule` since I don't have the repro;
removed `test_linear_1d_weight_mismatch_bias_dtype` since the old behavior is not supported in aten;
disabled `test_int_tensor_input` for pre-volta device since we have reduction `amax` in the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75340
Approved by: https://github.com/davidberard98
2022-04-07 00:58:04 +00:00
Horace He
5994d68484 Reland NVFuser guard changes
Reland of https://github.com/pytorch/pytorch/pull/75016 with `USE_CUDA` => `USE_NVFUSER`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75303
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2022-04-06 06:32:34 +00:00
David Berard
83400e836e [JIT] nvfuser CI fixes
* test_native_batch_norm_backward
* test_reduction_empty_axes
* test_register_fuser
* test_category_rule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75116

Approved by: https://github.com/jjsjann123, https://github.com/eellison
2022-04-04 22:19:03 +00:00
David Berard
c5b3727e5e [JIT] OpInfo tests for nvfuser (#71299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71299

These tests verify that for the same inputs, the eager version of an op
and a traced, fused version of the op return the same output.

Currently the tests don't check whether or not fusion actually occurred.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D33595299

Pulled By: davidberard98

fbshipit-source-id: 26fdacf44941808c134953e7a883a02d13a43f19
(cherry picked from commit 8cd084e2e3130fcd5f9c99302d6d9bf4e21c25bb)
2022-04-01 23:48:30 +00:00
David Berard
27deefb5e1 [JIT] Enable NVFuser tests in OSS CI (#73322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73322

These tests have been disabled in OSS CI since #34785.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D34436844

Pulled By: davidberard98

fbshipit-source-id: c5b14b33e7f369a6fa1e9cfbcb484a30dffc659e
(cherry picked from commit b08f51587c0203c3e8b69f06ea613759e740aa4f)
2022-04-01 23:48:30 +00:00
David Berard
e9e75215e2 [JIT] Optionally validate nvfuser outputs after execution (#74361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74361

This adds an optional validation after executing an NVFuser node, which checks that the output is the same as the unfused implementation. Then the outputs and the graph are reported via a callback.

```python
import torch

def callback(x, y, graph):
    for i in range(len(x)-amt, len(x)):
        print(x[i])
        print(y[i])
    print(graph)

with torch.jit.fuser("fuser2"):
    torch._C._jit_nvfuser_set_comparison_callback(True, callback)

    torch.jit.script
    def g(x, y):
        z = torch.add(x, y)
        return torch.sin(z)

    def f(x, y, a):
        z = torch.add(x, y)
        return g(torch.relu(z), a)

    f_s = torch.jit.script(f)
    x = torch.rand((10, 10), dtype=torch.half).cuda()
    y = torch.rand((10, 10), dtype=torch.half).cuda()
    a = torch.rand((10, 10), dtype=torch.half).cuda()
    f_s(x, y, a)
    f_s(x, y, a)
    f_s(x, y, a)
```

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D34975310

Pulled By: davidberard98

fbshipit-source-id: 2379c9a6f371cd58da6a187c1f16882f3923ab24
(cherry picked from commit 96c87992c65f5e6bb1bdd51791682dd837af99b4)
2022-04-01 23:48:30 +00:00
PyTorch MergeBot
1352c6417a Revert "Nvfuser guard patch"
This reverts commit d86181f745.

Reverted https://github.com/pytorch/pytorch/pull/75016 on behalf of https://github.com/malfet
2022-04-01 23:45:55 +00:00
jjsjann123
d86181f745 Nvfuser guard patch
Fixes issue where CudaFusionGuard would return false on backward graph because `requires_grad` flag doesn't match.

This is due to the fact that autodiff uses GradMode switch to turn on/off requires_grad, which is not taken into consideration by nvfuser guard. We verified the implementation under `TensorType::matchTensor`.

- [x] Add python test to verify no fallback is observed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75016
Approved by: https://github.com/eellison
2022-04-01 14:23:48 +00:00
jjsjann123
873ced7cd0 Nvfuser code bump 030122 (#73627)
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
2022-03-31 08:18:22 +00:00
jiej
86c817cfa0 Requires grad guard
Adding CudaFusionGuard to guard on device/requires_grad of profiled tensor type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74780
Approved by: https://github.com/davidberard98
2022-03-29 19:23:10 +00:00
jiej
13f28df460 disable contiguity on cross dimensional overlapped tensor
Unmarked contiguity on stride properties when we have dimensions potentially covering overlapping memory.
This check could be done more accurately, per dimension instead of a global flag per tensor. I'm just keeping it simple here, as the existing code gives us correctness and that's what's important.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74359
Approved by: https://github.com/ngimel, https://github.com/malfet
2022-03-23 21:17:42 +00:00
jiej
e4e19d5beb nvfuser parser skip api (#74520)
Summary:
added python API to disable nvfuser on certain opkind.

```
          "_jit_set_nvfuser_skip_node_kind",
          [](const std::string& op_name, bool flip = true) {
            return fuser::cuda::skipNode(op_name, flip);
          })
```

Args:
    `op_name`: Symbol of op;
    `flip`: flag indicating whether to flip the given op in the skip list.
Returns:
    a bool flag indicating if `op_name` was already in the skip list.

The python example that disables the fusion of `aten::add` afterwards.
`torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True)  # returns False, as no op is in skip list by default`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520

Reviewed By: saketh-are

Differential Revision: D35046110

Pulled By: davidberard98

fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6
(cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)
2022-03-23 20:56:43 +00:00
PyTorch MergeBot
a7866ada1c Revert "disable contiguity on cross dimensional overlapped tensor"
This reverts commit 6c383dede5.

Reverted https://github.com/pytorch/pytorch/pull/74359 on behalf of https://github.com/malfet
2022-03-23 20:54:22 +00:00
jiej
6c383dede5 disable contiguity on cross dimensional overlapped tensor
Unmarked contiguity on stride properties when we have dimensions potentially covering overlapping memory.
This check could be done more accurately, per dimension instead of a global flag per tensor. I'm just keeping it simple here, as the existing code gives us correctness and that's what's important.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74359
Approved by: https://github.com/ngimel
2022-03-23 17:42:38 +00:00
jiej
2d110d514f Nvfuser code bump 2_1_2022 (#72127)
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported

nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor

Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127

Reviewed By: HamidShojanazeri

Differential Revision: D34113233

Pulled By: jbschlosser

fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
2022-02-15 00:43:16 +00:00
Ryan Spring
4f8b986e28 Implement Tanh Gelu Approximation (#61439)
Summary:
1. Implements https://github.com/pytorch/pytorch/issues/39853
2. Adds approximate boolean flag to Gelu
3. Enables Tanh Gelu approximation
4. Adds double backward support for Gelu
5. Enable Tanh Gelu in NvFuser

```
def gelu(x, approximate : str = 'none'):
    if approximate == 'tanh':
        # sqrt(2/pi) = 0.7978845608028654
        return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * (x + 0.044715 * torch.pow(x, 3.0))))
    else:
        return x * normcdf(x)
```

Linking XLA PR - https://github.com/pytorch/xla/pull/3039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61439

Reviewed By: VitalyFedyunin

Differential Revision: D33894937

Pulled By: jbschlosser

fbshipit-source-id: b65e8fb6ea66168af8f34f45ed50e92737a33851
(cherry picked from commit 6e986f91a9)
2022-02-14 03:40:32 +00:00
jjsjann123
e429a68478 Allow single node fusion for nvfuser (#70000)
Summary:
Setting `PYTORCH_NVFUSER_ONE_OP_FUSION=1` will take all nodes nvFuser support, instead of waiting for fusion opportunity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70000

Reviewed By: samdow

Differential Revision: D33292195

Pulled By: davidberard98

fbshipit-source-id: 8ed5ce5e82fbb6737e8ab5ce4223b038eaf47756
2021-12-23 17:07:57 -08:00
jiej
78f06e0690 fixing conv2d decomposition and tests (#70127)
Summary:
Current implementation has a bug where decomposed `add_optional` from `conv2d` is placed before the producer node, this causes linter error on graph.

Cherry-picked from https://github.com/csarofeen/pytorch/pull/1333
Fixing issue posted in https://github.com/csarofeen/pytorch/issues/1325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70127

Reviewed By: ejguan

Differential Revision: D33199018

Pulled By: jansel

fbshipit-source-id: bce1f14a443811b4d55116a04fd4daa86084cc47
2021-12-19 10:38:23 -08:00
jiej
76d282d447 Nvfuser code bump 12 5 (#69964)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
2021-12-16 08:28:54 -08:00
jjsjann123
fed9b90ed4 fixing removeProfilingNodes duplicated functions (#1282) (#68804)
Summary:
Unfortunately there're two versions of removeProfilingNodes function and one of them is not cleaning up profile_ivalue nodes properly. This leads to a dangling profile_ivalue node, which ended up being profiled multiple times and could give us false assert failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68804

Reviewed By: mrshenli

Differential Revision: D32980157

Pulled By: Krovatkin

fbshipit-source-id: cd57c58a941d10ccd01a6cd37aac5c16256aaea6
2021-12-13 22:54:30 -08:00
jjsjann123
0dc3f829d9 Nvfuser code bump 11 5 (#67943)
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
2021-11-17 01:22:17 -08:00
Jane Xu
09c7771e9c Set test owners for jit tests (#66808)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66808

Reviewed By: mrshenli

Differential Revision: D31761414

Pulled By: janeyx99

fbshipit-source-id: baf8c49ff9c4bcda7b0ea0f6aafd26380586e72d
2021-10-25 07:51:10 -07:00
jjsjann123
d609957c95 patching graph_for (#55139)
Summary:
Allows individual DifferentiableGraphOp to display optimized forward graph. This improves user visibility to graph mutation via optimization pass, especially fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55139

Reviewed By: albanD

Differential Revision: D31330909

Pulled By: dzhulgakov

fbshipit-source-id: c745b482fdc34876dc404cbe3bacd99dcf2ac724
2021-10-04 21:50:22 -07:00
jiej
127c9402d0 Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137)
Summary:
This reverts commit 03389dc851.

Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745
Fixes the windows build failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d
2021-09-22 04:54:51 -07:00