Summary:
NNAPI: Internal test infra can't find test_nnapi.py. Easiest solution is to just skip these tests if test_nnapi.py can't be found
test_ivalue: fails due to qscheme op not implemented for CPU backend. In OSS, it doesn't run because it's not included in test_jit.py.
CPU NNC tests: test_working_byte_cpu_float32 is failing, but hard to repro; we don't use CPU NNC internally, so let's just skip CPU NNC tests internally.
Differential Revision: D48041615
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108937
Approved by: https://github.com/eellison
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
The AMP inserts `_autocast_to_reduced_precision` and `_autocast_to_full_precision` automatically. The aten implementation provides a fast path to bypass the conversion if the tensor data type has been the reduced/full precision. But NNC always does the conversion which could bring >5% E2E performance regression.
This PR is to address the performance issue like aten. We will not pull `_autocast_to_reduced_precision` and `_autocast_to_full_precision` into NNC fusion group and fallback to aten to trigger its fast path if the tensor data type has been the reduced/full precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85140
Approved by: https://github.com/frank-wei
buildShapeExpressions skips shape building for nodes if their inputs are
unknown.
Before prim::ConstantChunk ops were not skipped if their inputs were
unknown, which caused issues for graphs like:
```
graph(%x.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu),
%y.1 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu)):
%2 : Long(requires_grad=0, device=cpu) = prim::Constant[value={4}]() # skip, constants unsupported
%3 : int = prim::Constant[value=1]() # skip, constants unsupported
%4 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::add(%x.1, %y.1, %3) # calculate
%5 : Float(4, 4, strides=[4, 1], requires_grad=0, device=cpu) = aten::add(%4, %2, %3) # skip, because %2 doesn't have shapes defined in the map
%6 : Float(4, 2, strides=[4, 1], requires_grad=0, device=cpu), %7 : Float(4, 2, strides=[4, 1], requires_grad=0, device=cpu) = prim::ConstantChunk[chunks=2, dim=1](%5) # <-- FAIL because %5 isn't defined
%8 : Float(4, 2, strides=[2, 1], requires_grad=0, device=cpu) = aten::mul(%6, %7) # ...
```
(buildShapeExpressions would fail with std::out_of_range because the value was not found in the shapes map)
This moves the skip logic before the prim::ConstantChunk case to avoid this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82698
Approved by: https://github.com/eellison
This PR ...
Makes the following testing changes:
- Updates stride testing in test_python_reference_consistency to only check strides of dimensions with length > 1
- Creates reference inputs for reshape
- Creates reference inputs for chunk
- Extends the sample inputs for unsqueeze
- Extends the sample inputs for stack -- test_conj_view and test_neg_view are now xfailed
- https://github.com/pytorch/pytorch/issues/77046
Makes the following architecture changes:
- Adds the refs.special (sub)module
- Adds the refs.nn.functional (sub)module
Adds the following prims:
- expand_dims
- view_of
- rev
- clone
Adds the following references:
- flatten
- squeeze
- unsqueeze
- special.i0e
- special.i1e
- logical_or
- logical_and
- isclose
- flip
- stack
- nn.functional.elu
- chunk
- clone
- narrow
Identifies the following bugs in PyTorch today:
- https://github.com/pytorch/pytorch/issues/77054
- https://github.com/pytorch/pytorch/issues/77055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77043
Approved by: https://github.com/ngimel
Fixes https://github.com/pytorch/pytorch/issues/75464 Adds a context manager that will throw if the ops in the context are not fused.
API is :
```
with torch.jit.strict_fusion():
...
```
A few TODOs:
[+] Compose/figure out how to do with autodiff - right now it will run on autodiff as well
[+] Support all of the nvfuser operators that are added in guarding
[+] Figure out what to do with control flow that isn't taken (right now it will just error). this is probably a source of the original issue :/ - will just error
[+] (After those are figured out) add to docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75777
Approved by: https://github.com/davidberard98
Previously, jit opinfos would only run the traced function once. This is a problem for NNC and NVFuser, where the fused implementation only runs on the second invocation.
This caches the traced function and calls the cached implementation, so that subsequent calls actually perform fusion and use the fused implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76000
Approved by: https://github.com/eellison
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73875
Previously we had a few settings:
- getExecutor - which toggled between Profiling Executor and Legacy
- getGraphOptimize - if true, overrides PE/Legacy to run with simple executor (no optimizations)
and then...
- getProfilingMode - which would set PE to 0 specializtions.
The last mode is redundant with getGraphOptimize, we should just remove it and use getGraphOptimize in these cases. It would lead to potentially invalid combinations of logic - what does mean if getProfilingMode is true but getExecutor is set to false ? This would lead to a bug in specialize_autograd_zero in this case, see: https://github.com/pytorch/pytorch/blob/master/torch%2Fcsrc%2Fjit%2Fpasses%2Fspecialize_autogradzero.cpp#L93.
The tests here are failing but get fixed with the PR above it, so i'll squash for landing.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D34938130
Pulled By: eellison
fbshipit-source-id: 1a9c0ae7f6d1cfddc2ed3499a5af611053ae5e1b
(cherry picked from commit cf69ce3d155ba7d334022c42fb2cee54bb088c23)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73762
TestCase.setUp() controls slowTest behavior, so calling super().setUp() will prevent fast tests from running in the slow test CI jobs.
example: https://github.com/pytorch/pytorch/runs/5413135014?check_suite_focus=true: despite PYTORCH_TEST_SKIP_FAST=1, TestTEFuserStatic tests are still running
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D34628769
Pulled By: davidberard98
fbshipit-source-id: 84311ec1db2ac60fcafb7b77f377e9ae2ef792e3
(cherry picked from commit 67fdba7fb9b73ce2b9119f4c4bc84e5b38041e21)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72478
aten::_autocast_to_reduced_precision and `aten::_autocast_to_full_precision are essentially just aten::to operations, so they can be fused the same way aten::to is fused.
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision: D34057522
Pulled By: davidberard98
fbshipit-source-id: f3b53641415702a4ac56460587801b9c76d81b3c
(cherry picked from commit 838ce5542e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70465
These tests check to ensure that
(a) the result after nnc fusion (of a single op) is the same as the
unfused op
(b) for certain ops where fusion is expected to occur, ensure that
fusion does actually occur
Test Plan: Imported from OSS
Reviewed By: wenleix
Differential Revision: D33595240
Pulled By: davidberard98
fbshipit-source-id: e2e17a921bc30c313e92e8e5bbc6c1b5fcd14bc1
(cherry picked from commit b1ba221acc)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72266
Within the kernel, we may manipulate `Value *` in `OptimizeCat`, which would invalidate the input `Value *` -> Stride mapping.
Fix for https://github.com/pytorch/pytorch/issues/72173
Test Plan: Imported from OSS
Reviewed By: dagitses, davidberard98
Differential Revision: D33986306
Pulled By: eellison
fbshipit-source-id: dc33cd2b545e49e90d1e46b9fcf1e6dbb4b829db
(cherry picked from commit 5e4555968a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72032
This contains a few channels last changes from benchmarking:
- dont permute back to channels last on dynamic, cpu, perf is not good, and use cases for it are exotic atm
- remove the conditional one handling in permutting channels last symbolic tensor on cuda, it's not needed in the permutation case as tests show
- removing logic in torch/csrc/jit/tensorexpr/loopnest.cpp preventing inlining. the condition in checks is always valid given valid construction of ir
I can split up as needed.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33864652
Pulled By: eellison
fbshipit-source-id: f16674fb02dfff22670d8a2f856c5a317fd15717
(cherry picked from commit a9a0697839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71651
The only tests that regress are because chunk NYI, the other tests that I touched were passing just because the `assertAllFused` wasn't working correctly. That, and we're no longer compiling conv/matmul w dynamic shapes
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33801500
Pulled By: eellison
fbshipit-source-id: 074118ab4a975b7db876a4fcdfb9483afb879e79
(cherry picked from commit abaa7948c1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71650
*
Refactors PE so there is a current fusion strategy set, which will take in a vector of e.g. [(STATIC, 2), (DYNAMIC, 10)] which means fuse two static invocations then fuse 10 dynamic ones, then stop specializing.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33801501
Pulled By: eellison
fbshipit-source-id: ebc7ac3c57e35a3b9bb15ab751f0aa1d25cc9bd5
(cherry picked from commit 8dd89088d3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71642
Missing comma was causing string concatenation in a list of strings
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D33713185
Pulled By: davidberard98
fbshipit-source-id: a2458629d78202713a5bb2f8c720ff9b81939c31
(cherry picked from commit b077598f1d)
Summary:
The block and thread extent calculations in `cuda_codegen` should be using `int64_t` instead of `int`. The updated test, `test_dynamic_shapes`, fails without this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71428
Reviewed By: samdow
Differential Revision: D33640374
Pulled By: navahgar
fbshipit-source-id: 64c340ad2a9a1fa1fe066cf1c5dfc3b546b7be6d
(cherry picked from commit 6ea546ce11)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70464
Add handling of strided input tensors to dynamic fusion. This is done with the same set of input striding specializations as https://github.com/pytorch/pytorch/pull/60684/:
```
S_ONE, // STRIDE_ONE: packed
S_CONT, // STRIDE_CONTIGUOUS: stride[i + 1] * sizes[i + 1]
S_TRAN_CONT, // STRIDE_TRANSPOSED_CONTIGUOUS: stride[i-1] * sizes[i-1]
S_AS_ARG, // STRIDE_AS_ARG: stride passed in as runtime value
```
and then two additional specializations for a) contiguous tensor and b) channels-last tensor. channels-last is a common case and we should optimize for it. additionally, tensors natively store whether they are contiguous/channels-last contiguous, which makes it faster to check if tensors follow this pattern.
Output striding will be done in a follow up.
The striding is stored on both the TensorGroup node and on the guard node. The striding descriptors are stored as a vector of strings on the node for debugability and to make use of storing ivalues as attributes on nodes.
As an example:
```
%8 : Double(10, 11, 12, 13, strides=[1716, 1, 143, 11], requires_grad=0, device=cpu) = prim::TensorExprGroup_0[symbolic_shape_inputs=[-37, -36, -35, -34], striding_inputs_desc=[["TENSOR_CONT_CHANNELS_LAST"]](%x, %24, %23, %22, %21)```
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33458649
Pulled By: eellison
fbshipit-source-id: c42616d3c683d70f6258180d23d3841a31a6030d