Sometimes bias won't have shape info (e.g. in the added test, conv gets run two times in a loop, each with different shapes). In that case we should just skip decomposition instead of erroring out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77440
Approved by: https://github.com/jjsjann123
Enables guard mode in opinfo tests.
Fixes opinfo failures for
test_nvfuser_correctness__masked_var_cuda_xxxx
test_nvfuser_correctness__masked_std_cuda_xxxx
The root cause of the failure is that tracing changes stride properties and causes nvfuser to use wrong kernel and generate wrong results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77273
Approved by: https://github.com/davidberard98
Extended permutation support in integration (See more details on https://github.com/csarofeen/pytorch/issues/1601). This update allows us to better support permutation propagation on tensors, specifically for binary ops with inputs of different ranks. Our goal is to avoid permuting tensors unless absolutely necessary. We try to preserve the permutation propagation rule in aten, with some known limitation at the time.
The idea in this implementation is the same as with our existing code, which is to permute input/output tensors outside of codegen: For a simplified binary op scenario: `output = binaryOp(input0, input1)`
1. In a simple case where `input0` and `input1` come with the same rank & permutation order, our output would preserve the same permutation;
2. For cases where `input0` and `input1` come with different ranks but with **compatible** permutation, the tensor with the higher rank dictates the permutation of the output;
3. For cases where `input0` and `input1` come with different ranks but with **in-compatible** permutation, this is where permutation propagation fails and the output tensor will be contiguous.
By **compatible** permutation, it means that we can permute the higher rank tensor to contiguous format, and then apply a second permutation to the tensor with lower rank to match their axes. This check is implemented in `MemoryFormat::broadcastToRank(int lower_rank)`.
Some concrete example (note that we comply with eager propagation on cases 1-3, but diverge in behavior for cases 4, 5):
1. different rank & same permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(h, w, c).cuda().permute([2, 0, 1]) # stride (1, wc, c)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
2. different rank & compatible permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(c, h, w).cuda() # stride (hw, w, 1)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
3. different rank & compatible permutation with broadcasting
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(c).cuda().unsqueeze(-1).unsqueeze(-1) # stride (1, 1, 1)
out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) preserving memory format of t0
```
4. different rank & in-compatible permutation
```
t0 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
t1 = torch.randn(h, w).cuda() # stride (w, 1)
jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, wc, c, 1) # nvfuser outputs contiguous tensor
eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # TI preserves memory format of LHS operand
```
5. different rank & in-compatible permutation
```
t0 = torch.randn(c, h, w).cuda() # stride (hw, w, 1)
t1 = torch.randn(b, h, w, c).cuda().permute([0, 3, 1, 2]) # stride (hwc, 1, wc, c)
jit_out = scripted_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, 1, wc, c) # nvfuser preserves memory format of highest rank tensors
eager_out = eager_add(t0, t1) # stride (hwc, 1, wc, c) # stride (hwc, hw, w, 1) # TensorIterator preserves memory format of LHS operand
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76563
Approved by: https://github.com/kevinstephano, https://github.com/ngimel
Fixes https://github.com/pytorch/pytorch/issues/75464 Adds a context manager that will throw if the ops in the context are not fused.
API is :
```
with torch.jit.strict_fusion():
...
```
A few TODOs:
[+] Compose/figure out how to do with autodiff - right now it will run on autodiff as well
[+] Support all of the nvfuser operators that are added in guarding
[+] Figure out what to do with control flow that isn't taken (right now it will just error). this is probably a source of the original issue :/ - will just error
[+] (After those are figured out) add to docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75777
Approved by: https://github.com/davidberard98
Previously, jit opinfos would only run the traced function once. This is a problem for NNC and NVFuser, where the fused implementation only runs on the second invocation.
This caches the traced function and calls the cached implementation, so that subsequent calls actually perform fusion and use the fused implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76000
Approved by: https://github.com/eellison
torch.version.cuda.split('.') can have 2 or 3 elements depending on whether the version string contains a patch number. This updates the test so it doesn't error out when the version has > 2 parts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75706
Approved by: https://github.com/ngimel
Previously, cuda_graph_fuser.h registration of the nvfuser pass used `at::globalContext().hasHIP()` to check whether we were using ROCm/HIP. However, I don't think that check actually does anything; on the ROCm CI jobs the registration would still succeed.
Instead it's replaced with `#ifdef USE_ROCM`.
Verified this by enabling the NVFuser tests on ROCm and running in CI.
Before this change: the NVFuser test in CI on ROCm would throw really long and complex errors.
Now, it errors out immediately when trying to enable nvfuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75284
Approved by: https://github.com/eellison
Re-enabled the failing test `test_category_rule` since I don't have the repro;
removed `test_linear_1d_weight_mismatch_bias_dtype` since the old behavior is not supported in aten;
disabled `test_int_tensor_input` for pre-volta device since we have reduction `amax` in the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75340
Approved by: https://github.com/davidberard98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71299
These tests verify that for the same inputs, the eager version of an op
and a traced, fused version of the op return the same output.
Currently the tests don't check whether or not fusion actually occurred.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D33595299
Pulled By: davidberard98
fbshipit-source-id: 26fdacf44941808c134953e7a883a02d13a43f19
(cherry picked from commit 8cd084e2e3130fcd5f9c99302d6d9bf4e21c25bb)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73322
These tests have been disabled in OSS CI since #34785.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D34436844
Pulled By: davidberard98
fbshipit-source-id: c5b14b33e7f369a6fa1e9cfbcb484a30dffc659e
(cherry picked from commit b08f51587c0203c3e8b69f06ea613759e740aa4f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74361
This adds an optional validation after executing an NVFuser node, which checks that the output is the same as the unfused implementation. Then the outputs and the graph are reported via a callback.
```python
import torch
def callback(x, y, graph):
for i in range(len(x)-amt, len(x)):
print(x[i])
print(y[i])
print(graph)
with torch.jit.fuser("fuser2"):
torch._C._jit_nvfuser_set_comparison_callback(True, callback)
torch.jit.script
def g(x, y):
z = torch.add(x, y)
return torch.sin(z)
def f(x, y, a):
z = torch.add(x, y)
return g(torch.relu(z), a)
f_s = torch.jit.script(f)
x = torch.rand((10, 10), dtype=torch.half).cuda()
y = torch.rand((10, 10), dtype=torch.half).cuda()
a = torch.rand((10, 10), dtype=torch.half).cuda()
f_s(x, y, a)
f_s(x, y, a)
f_s(x, y, a)
```
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D34975310
Pulled By: davidberard98
fbshipit-source-id: 2379c9a6f371cd58da6a187c1f16882f3923ab24
(cherry picked from commit 96c87992c65f5e6bb1bdd51791682dd837af99b4)
Fixes issue where CudaFusionGuard would return false on backward graph because `requires_grad` flag doesn't match.
This is due to the fact that autodiff uses GradMode switch to turn on/off requires_grad, which is not taken into consideration by nvfuser guard. We verified the implementation under `TensorType::matchTensor`.
- [x] Add python test to verify no fallback is observed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75016
Approved by: https://github.com/eellison
Summary:
Things changed in this PR that requires review:
test/forward_backward_compatibility/check_forward_backward_compatibility.py
Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.
nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627
Reviewed By: Chillee
Differential Revision: D34765458
Pulled By: davidberard98
fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
Unmarked contiguity on stride properties when we have dimensions potentially covering overlapping memory.
This check could be done more accurately, per dimension instead of a global flag per tensor. I'm just keeping it simple here, as the existing code gives us correctness and that's what's important.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74359
Approved by: https://github.com/ngimel, https://github.com/malfet
Summary:
added python API to disable nvfuser on certain opkind.
```
"_jit_set_nvfuser_skip_node_kind",
[](const std::string& op_name, bool flip = true) {
return fuser::cuda::skipNode(op_name, flip);
})
```
Args:
`op_name`: Symbol of op;
`flip`: flag indicating whether to flip the given op in the skip list.
Returns:
a bool flag indicating if `op_name` was already in the skip list.
The python example that disables the fusion of `aten::add` afterwards.
`torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True) # returns False, as no op is in skip list by default`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520
Reviewed By: saketh-are
Differential Revision: D35046110
Pulled By: davidberard98
fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6
(cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)
Unmarked contiguity on stride properties when we have dimensions potentially covering overlapping memory.
This check could be done more accurately, per dimension instead of a global flag per tensor. I'm just keeping it simple here, as the existing code gives us correctness and that's what's important.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74359
Approved by: https://github.com/ngimel
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported
nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor
Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127
Reviewed By: HamidShojanazeri
Differential Revision: D34113233
Pulled By: jbschlosser
fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69964
Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h
nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)
Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR https://github.com/pytorch/pytorch/issues/68804
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69428
Reviewed By: ngimel
Differential Revision: D33073817
Pulled By: wconstab
fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
Summary:
Unfortunately there're two versions of removeProfilingNodes function and one of them is not cleaning up profile_ivalue nodes properly. This leads to a dangling profile_ivalue node, which ended up being profiled multiple times and could give us false assert failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68804
Reviewed By: mrshenli
Differential Revision: D32980157
Pulled By: Krovatkin
fbshipit-source-id: cd57c58a941d10ccd01a6cd37aac5c16256aaea6
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.
Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943
Reviewed By: ngimel
Differential Revision: D32288709
Pulled By: dzhulgakov
fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:
- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).
To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.
internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`
updates affecting integration:
1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745
Reviewed By: saketh-are
Differential Revision: D30752939
Pulled By: malfet
fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c