This PR fixes a bug with `broadcast_in_dim` leading to the situation when reduction ops were not allowed to be used before `broadcast_in_dim`.
With this PR it's possible to run
```py
import torch
import torch._refs
from torch._prims.executor import make_traced
def foo(a):
return torch._refs.mean(a, keepdim=False)
a = torch.randn(3, 3, device='cuda')
make_traced(foo)(a, executor="nvfuser")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79444
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Bug fixes and minor refactor
Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:
```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```
RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
adding a link to github 1.12 release branch nvfuser README.md in jit doc
Note that this PR is intended to be cherry-picked by 1.12 release, we'll have a follow up PR to update the link once this PR is merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78160
Approved by: https://github.com/davidberard98
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
A few bigger updates:
1. Initial support of cp.async and cp.async.wait: https://github.com/csarofeen/pytorch/pull/1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: https://github.com/csarofeen/pytorch/pull/1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: https://github.com/csarofeen/pytorch/pull/1440
Commits that's actually in this PR from the csarofeen branch
```
* dd2325294e236c5082c642819a1103bcfe4561a3 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f446355a2d276bac8272e7aa8b5bb6b1f0 Fix missing cooperative launch (#1726)
* dc670a226cbe52be46cecef47001f38bf9a09433 Async gmem copy support on sm80+ (#1619)
* 5e6a8dab5a71aefe0548bbfa15d1a93c556d23fe Add turing mma support and test (#1643)
* d6d6b7d3f10dd91dafa4cdbd5e460bbb38173af4 Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39150c6d80e0f9f767d56654714a2e8a927 Mma op integration on ampere (#1440)
* fade8da55e60a118c5595378896d34b862b2fcc3 patch python test for bfloat16 (#1724)
* 8fbd0b18743a72ac10478857c3d2351204375685 Fine-grained kernel profiling (#1720)
* 77c1b4fa633f9e631d267923f4537336fa328939 Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b97bebefc94199bb4a53423ede32b55451 More precise concretization analysis (#1719)
* f4d3630ed54d7069dd377a64be1f91013b285b66 Enable complex python tests (#1667)
* 4ceeee509774cc2ce6c834a4dc1e313f71d94503 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70faf218e86d2c78dbd3874b175a3b0a203 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830d5def65dadfe29d4edf52fc703369c84a Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7ae2cfd8fffad1e1d882ae7c50631211dc updating_ci_machine (#1718)
* 56585c58b1ff338704cafb0cd6be2b3d536bed5a Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453d3be0c11a5acb0fff3b3f36e19cfdaf81 Allow using nvFuser on CUDA extension (#1701)
* 18bee67495454b9a79625799776e746bd5e81c4c Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
Sometimes bias won't have shape info (e.g. in the added test, conv gets run two times in a loop, each with different shapes). In that case we should just skip decomposition instead of erroring out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77440
Approved by: https://github.com/jjsjann123
This PR primarily addresses augmenting the frontend to properly support `broadcast_in_dim`. This required make a new version of the `define_tensor()` that takes in the `size` and `strides` of input tensors in order to properly determine broadcasts.
This PR also has a fix for the `python_example.py` that broke when a new argument was added to reductions to allow the user to specify an output Data Type.
`define_tensor()` Interface Example:
```
fusion2 = Fusion()
input1 = torch.ones(1, 1, 4, device='cuda')
input2 = torch.ones(2, 3, 4, device='cuda')
with FusionDefinition(fusion2) as fd :
t0 = fd.define_tensor(sizes=input1.size(), strides=input1.stride())
t1 = fd.define_tensor(sizes=input2.size(), strides=input2.stride())
fd.add_input(t0)
fd.add_input(t1)
t0_b = fd.Ops.broadcast_in_dim(t0, [2, 3, 4], [0, 1, 2])
print("Broadcast TensorView", t0_b)
t2 = fd.Ops.add(t0_b, t1)
fd.add_output(t2)
```
Print statement of defined broadcast tensor:
```
Broadcast TensorView T2_l[ sbS6{1}, sbS7{1}, iS8{i2} ] DataType: float Contiguity: ttt
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76790
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
Re-landing #68111/#74596
## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of #50256, the below improvements are included:
* The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
* The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
* SkyLake 8180 (1 socket of 28 cores):

* SkyLake 8180 (single thread):

* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code is placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is in:
```
caffe2/CMakeLists.txt
cmake/public/mkldnn.cmake
cmake/Modules/FindMKLDNN.cmake
```
## Limitations
* In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
* We have only optimized the inference use-case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485
Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for
controlling whether or not nvfuser is enabled. This required changing
the PassManager behavior to support the case where nvfuser gets enabled
by default when PYTORCH_JIT_ENABLE_NVFUSER=1.
Previously the solution for turning nvfuser on or off was to use the
PassManager to register or un-register the pass. That works fine if the
pass starts of _disabled_, but causes issues once we try to enable the
pass by default.
The main issue with enabling by default is with the validation check to
see whether NVFuser can be turned on. The check relies on
at::globalContext().hasCUDA(), which requires CUDAHooks to be registered
before hasCUDA() wil work correctly. At static initialization time it's
difficult to ensure that CUDAHooks will be registered _before_ we
attempt to register the nvfuser pass. In OSS it worked fine, but in
internal builds it would fail on ROCm builds.
To fix this, we switch the control of NVFuser enablement to a check in
the pass. i.e. previously, we enabled/disabled nvfuser by registering or
de-registering the pass in pass manager; now, the pass is always
registered in pass manager, and enablement is done by a check within the
nvfuser pass.
Remaining TODO: Connect this with NNC so that in cases where NNC is
available but not NVFuser (i.e. on AMD gpus), NNC can be turned on
automatically.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D35982618
Pulled By: davidberard98
fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60
(cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)
Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time.
The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)`
1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation;
2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output;
3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous.
By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`.
Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5):
1. different rank & same permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
2. different rank & compatible permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
3. different rank & compatible permutation with broadcasting
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
4. different rank & in-compatible permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(h, w).cuda() # stride (w, 1)
jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor
eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand
```
5. different rank & in-compatible permutation
```
t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1)
t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors
eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
I noticed that when `SymInt` was introduced, `jit_type_base.h` was
added as an include to the `Operator.h` template which is supposed to
be kept extremely clean and only use forward declarations. Also,
that forward declarations for `OptionalArrayRef` were missing.
So, I've refactored the forward declarations into
`ATen/core/ATen_fwd.h` and cleaned up some of the `c10`
headers that were masking these missing declarations. I've also
re-generated the pre-compiled header so `SymInt` is included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76576
Approved by: https://github.com/albanD
Retry of #75983. The change is to handle cases where attr::cache_id is
not set. This can happen if compilation fails.
Original message:
1) remember when fusions fail; and on subsequent runs, always take the fallback.
2) during the first fallback, cache the Code object.
On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows:
* Original (always attempt fusion): 25ms
* Always take fallback after first failure: 0.79ms
* Always take fallback + cache Code object: 0.62ms
* Eager: 0.58ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76505
Approved by: https://github.com/eellison
Fixes#75708
`--ptxas-options` only passes its immediate argument to ptxas. So we should have put that in front of every ptxas argument.
It's actually strange how this worked in CUDA TK 11.6. I'm following up with nvrtc team on this internally, meanwhile we should merge this PR to avoid register failures in generated kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76226
Approved by: https://github.com/davidberard98
1) remember when fusions fail; and on subsequent runs, always take the fallback.
2) during the first fallback, cache the Code object.
On autogen-69 from the nvfuser microbenchmarks (https://github.com/pytorch/benchmark/pull/801) this improved performanance as follows:
* Original (always attempt fusion): 25ms
* Always take fallback after first failure: 0.79ms
* Always take fallback + cache Code object: 0.62ms
* Eager: 0.58ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75983
Approved by: https://github.com/jjsjann123