This PR ...
Makes the following testing changes:
- Updates stride testing in test_python_reference_consistency to only check strides of dimensions with length > 1
- Creates reference inputs for reshape
- Creates reference inputs for chunk
- Extends the sample inputs for unsqueeze
- Extends the sample inputs for stack -- test_conj_view and test_neg_view are now xfailed
- https://github.com/pytorch/pytorch/issues/77046
Makes the following architecture changes:
- Adds the refs.special (sub)module
- Adds the refs.nn.functional (sub)module
Adds the following prims:
- expand_dims
- view_of
- rev
- clone
Adds the following references:
- flatten
- squeeze
- unsqueeze
- special.i0e
- special.i1e
- logical_or
- logical_and
- isclose
- flip
- stack
- nn.functional.elu
- chunk
- clone
- narrow
Identifies the following bugs in PyTorch today:
- https://github.com/pytorch/pytorch/issues/77054
- https://github.com/pytorch/pytorch/issues/77055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77043
Approved by: https://github.com/ngimel
Fixes https://github.com/pytorch/pytorch/issues/75464 Adds a context manager that will throw if the ops in the context are not fused.
API is :
```
with torch.jit.strict_fusion():
...
```
A few TODOs:
[+] Compose/figure out how to do with autodiff - right now it will run on autodiff as well
[+] Support all of the nvfuser operators that are added in guarding
[+] Figure out what to do with control flow that isn't taken (right now it will just error). this is probably a source of the original issue :/ - will just error
[+] (After those are figured out) add to docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75777
Approved by: https://github.com/davidberard98
Previously, jit opinfos would only run the traced function once. This is a problem for NNC and NVFuser, where the fused implementation only runs on the second invocation.
This caches the traced function and calls the cached implementation, so that subsequent calls actually perform fusion and use the fused implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76000
Approved by: https://github.com/eellison
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73875
Previously we had a few settings:
- getExecutor - which toggled between Profiling Executor and Legacy
- getGraphOptimize - if true, overrides PE/Legacy to run with simple executor (no optimizations)
and then...
- getProfilingMode - which would set PE to 0 specializtions.
The last mode is redundant with getGraphOptimize, we should just remove it and use getGraphOptimize in these cases. It would lead to potentially invalid combinations of logic - what does mean if getProfilingMode is true but getExecutor is set to false ? This would lead to a bug in specialize_autograd_zero in this case, see: https://github.com/pytorch/pytorch/blob/master/torch%2Fcsrc%2Fjit%2Fpasses%2Fspecialize_autogradzero.cpp#L93.
The tests here are failing but get fixed with the PR above it, so i'll squash for landing.
Test Plan: Imported from OSS
Reviewed By: cpuhrsch
Differential Revision: D34938130
Pulled By: eellison
fbshipit-source-id: 1a9c0ae7f6d1cfddc2ed3499a5af611053ae5e1b
(cherry picked from commit cf69ce3d155ba7d334022c42fb2cee54bb088c23)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73762
TestCase.setUp() controls slowTest behavior, so calling super().setUp() will prevent fast tests from running in the slow test CI jobs.
example: https://github.com/pytorch/pytorch/runs/5413135014?check_suite_focus=true: despite PYTORCH_TEST_SKIP_FAST=1, TestTEFuserStatic tests are still running
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D34628769
Pulled By: davidberard98
fbshipit-source-id: 84311ec1db2ac60fcafb7b77f377e9ae2ef792e3
(cherry picked from commit 67fdba7fb9b73ce2b9119f4c4bc84e5b38041e21)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72478
aten::_autocast_to_reduced_precision and `aten::_autocast_to_full_precision are essentially just aten::to operations, so they can be fused the same way aten::to is fused.
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision: D34057522
Pulled By: davidberard98
fbshipit-source-id: f3b53641415702a4ac56460587801b9c76d81b3c
(cherry picked from commit 838ce5542e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70465
These tests check to ensure that
(a) the result after nnc fusion (of a single op) is the same as the
unfused op
(b) for certain ops where fusion is expected to occur, ensure that
fusion does actually occur
Test Plan: Imported from OSS
Reviewed By: wenleix
Differential Revision: D33595240
Pulled By: davidberard98
fbshipit-source-id: e2e17a921bc30c313e92e8e5bbc6c1b5fcd14bc1
(cherry picked from commit b1ba221acc)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72266
Within the kernel, we may manipulate `Value *` in `OptimizeCat`, which would invalidate the input `Value *` -> Stride mapping.
Fix for https://github.com/pytorch/pytorch/issues/72173
Test Plan: Imported from OSS
Reviewed By: dagitses, davidberard98
Differential Revision: D33986306
Pulled By: eellison
fbshipit-source-id: dc33cd2b545e49e90d1e46b9fcf1e6dbb4b829db
(cherry picked from commit 5e4555968a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72032
This contains a few channels last changes from benchmarking:
- dont permute back to channels last on dynamic, cpu, perf is not good, and use cases for it are exotic atm
- remove the conditional one handling in permutting channels last symbolic tensor on cuda, it's not needed in the permutation case as tests show
- removing logic in torch/csrc/jit/tensorexpr/loopnest.cpp preventing inlining. the condition in checks is always valid given valid construction of ir
I can split up as needed.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33864652
Pulled By: eellison
fbshipit-source-id: f16674fb02dfff22670d8a2f856c5a317fd15717
(cherry picked from commit a9a0697839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71651
The only tests that regress are because chunk NYI, the other tests that I touched were passing just because the `assertAllFused` wasn't working correctly. That, and we're no longer compiling conv/matmul w dynamic shapes
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33801500
Pulled By: eellison
fbshipit-source-id: 074118ab4a975b7db876a4fcdfb9483afb879e79
(cherry picked from commit abaa7948c1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71650
*
Refactors PE so there is a current fusion strategy set, which will take in a vector of e.g. [(STATIC, 2), (DYNAMIC, 10)] which means fuse two static invocations then fuse 10 dynamic ones, then stop specializing.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33801501
Pulled By: eellison
fbshipit-source-id: ebc7ac3c57e35a3b9bb15ab751f0aa1d25cc9bd5
(cherry picked from commit 8dd89088d3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71642
Missing comma was causing string concatenation in a list of strings
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D33713185
Pulled By: davidberard98
fbshipit-source-id: a2458629d78202713a5bb2f8c720ff9b81939c31
(cherry picked from commit b077598f1d)
Summary:
The block and thread extent calculations in `cuda_codegen` should be using `int64_t` instead of `int`. The updated test, `test_dynamic_shapes`, fails without this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71428
Reviewed By: samdow
Differential Revision: D33640374
Pulled By: navahgar
fbshipit-source-id: 64c340ad2a9a1fa1fe066cf1c5dfc3b546b7be6d
(cherry picked from commit 6ea546ce11)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70464
Add handling of strided input tensors to dynamic fusion. This is done with the same set of input striding specializations as https://github.com/pytorch/pytorch/pull/60684/:
```
S_ONE, // STRIDE_ONE: packed
S_CONT, // STRIDE_CONTIGUOUS: stride[i + 1] * sizes[i + 1]
S_TRAN_CONT, // STRIDE_TRANSPOSED_CONTIGUOUS: stride[i-1] * sizes[i-1]
S_AS_ARG, // STRIDE_AS_ARG: stride passed in as runtime value
```
and then two additional specializations for a) contiguous tensor and b) channels-last tensor. channels-last is a common case and we should optimize for it. additionally, tensors natively store whether they are contiguous/channels-last contiguous, which makes it faster to check if tensors follow this pattern.
Output striding will be done in a follow up.
The striding is stored on both the TensorGroup node and on the guard node. The striding descriptors are stored as a vector of strings on the node for debugability and to make use of storing ivalues as attributes on nodes.
As an example:
```
%8 : Double(10, 11, 12, 13, strides=[1716, 1, 143, 11], requires_grad=0, device=cpu) = prim::TensorExprGroup_0[symbolic_shape_inputs=[-37, -36, -35, -34], striding_inputs_desc=[["TENSOR_CONT_CHANNELS_LAST"]](%x, %24, %23, %22, %21)```
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33458649
Pulled By: eellison
fbshipit-source-id: c42616d3c683d70f6258180d23d3841a31a6030d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70463
Fix for https://github.com/pytorch/pytorch/issues/52940
When we call inlining on a fallback function, insert the runtime optimized version of its graph.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, davidberard98
Differential Revision: D33458651
Pulled By: eellison
fbshipit-source-id: fd7e5e2b5273a1677014ba1a766538c3ee9cad76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70410
Trying again after #70174 was reverted. Earlier the env
variable was read into a static var in C++ causing state to be retained
and causing test failures. Static type is removed in this PR.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D33321435
fbshipit-source-id: 6d108eb00cac9150a142ccc3c9a65a1867dd7de4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67368
This PR adds an addition test variant for the tensor conversion
functions (bfloat16, char, long, ...) that tests channels_last. This is
because some backends (mostly just functorch right now) don't have
channels last handling and may want to test that separately from the
more general case of these operations.
Test Plan: - wait for tests
Reviewed By: mruberry
Differential Revision: D31972959
Pulled By: zou3519
fbshipit-source-id: 68fea46908b2cdfeb0607908898bb8f9ef25b264
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66990
NNC fusion groups currently show up as "TensorExpr" in the profiler,
which is true but not super useful since it obscures what's actually happening
in the fusion group. This change will log them as `fused_XXX` where XXX is a
(length-limited) series of ops describing the subgraph, for instance
`fused_mul_add` to represent a group containing `aten::mul`, `aten::add`.
Test Plan: New unit test to check the output of autograd profiler.
Reviewed By: dzhulgakov
Differential Revision: D31762087
fbshipit-source-id: 3fadbdc67b054faa01aa42e5b6ea2c4a6bc3481f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64282
OpInfos for:
- Tensor.bfloat16, Tensor.bool, Tensor.bypte, Tensor.char
- Tensor.double, Tensor.float, Tensor.half, Tensor.int
- Tensor.short, Tensor.long
None of these are supported by TorchScript. Also, the OpInfo autograd
test runner assumes that the operation is not allowed to change the
dtype of the argument, so only Tensor.double has
`supports_autograd=True` (in theory Tensor.bfloat16, Tensor.float,
Tensor.half should be differentiable).
Test Plan: - run tests
Reviewed By: dagitses
Differential Revision: D31452627
Pulled By: zou3519
fbshipit-source-id: b7f272e558558412c47aefe947af7f060dfb45c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65014
ghstack-source-id: 138656948
Test Plan:
```
(pytorch) [maxren@devvm3115.atn0 ~/pytorch] python3 test/test_jit.py TestPeephole
CUDA not available, skipping tests
monkeytype is not installed. Skipping tests for Profile-Directed Typing
........s......................
----------------------------------------------------------------------
Ran 31 tests in 0.393s
OK (skipped=1)
(pytorch) [maxren@devvm3115.atn0 ~/pytorch] python3 test/test_jit.py TestPeephole.test_normalized_rsub
CUDA not available, skipping tests
monkeytype is not installed. Skipping tests for Profile-Directed Typing
.
----------------------------------------------------------------------
Ran 1 test in 0.015s
OK
```
Reviewed By: eellison
Differential Revision: D30941389
fbshipit-source-id: 03f0416d99090845c9bfb1e5fcf771d5f1d7a050
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64589
Adding softplus operator lowering for NNC. Enabling element wise fusion as well.
Test Plan: Added a test in test_jit_fuser.py
Reviewed By: bertmaher
Differential Revision: D30736449
fbshipit-source-id: 6c5fc3bceb5cef2322ecd4449f827e4af018ea93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63516
how to review: pretty much just check that the inputs generated are a good representation of the op semantics, that should be sufficient for correctness, and then you can also double check the op size semantics by going to https://codebrowser.bddppq.com/pytorch/pytorch/ typing in native::{op_name} and looking at the op implementation as a bonus if you want
Test Plan: Imported from OSS
Reviewed By: driazati
Differential Revision: D30738143
Pulled By: eellison
fbshipit-source-id: c7cd01cb2c8a13cb2664415f3d98aedec19a8e07