Commit Graph

75 Commits

Author SHA1 Message Date
lezcano
bef61225c3 [decompositions] add decomposition for RNN with packed sequence (#91281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91281
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
e5f6e1f660 [decompositions] add LSTM decomp (#91124)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91124
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
c2a92687e0 [decompositions] add RNN decomp and testing (#91123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91123
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
Peter Bell
cee5174d44 Add test tracking operators without decompositions (#90887)
This test inspects the dispatcher directly, so captures operators without
`OpInfo` including internal helper operators and backward operators that might
appear in a trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90887
Approved by: https://github.com/ezyang
2023-01-26 01:44:42 +00:00
PyTorch MergeBot
a2da0a0b02 Revert "Add test tracking operators without decompositions (#90887)"
This reverts commit 2740daf701.

Reverted https://github.com/pytorch/pytorch/pull/90887 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. We reverted https://github.com/pytorch/pytorch/pull/70988 in acdd462b1a and this test starts to fail. There is probably a dependency between the twos
2023-01-24 21:56:58 +00:00
Peter Bell
2740daf701 Add test tracking operators without decompositions (#90887)
This test inspects the dispatcher directly, so captures operators without
`OpInfo` including internal helper operators and backward operators that might
appear in a trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90887
Approved by: https://github.com/ezyang
2023-01-24 17:38:27 +00:00
lezcano
66e498626c Perform first the decomposition and then the ATen function to catch in-place modifications (#92243)
Addresses https://github.com/pytorch/pytorch/pull/91672#discussion_r1070412867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92243
Approved by: https://github.com/ezyang
2023-01-17 16:53:36 +00:00
lezcano
ea8b14f27e Add a test for decompositions that decomposes all the operations as much as possible (#87182)
This will enable a more thorough testing of the decompositions than the
one just provided by OpInfos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87182
Approved by: https://github.com/ezyang
2023-01-17 16:53:34 +00:00
lezcano
d162c8f92b Assorted decomposition fixes (#87183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87183
Approved by: https://github.com/ngimel
2023-01-17 16:53:31 +00:00
Yanbo Liang
25f39c1bce Fix uniform ref implementation (#90094)
Fixes https://github.com/pytorch/torchdynamo/issues/1954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90094
Approved by: https://github.com/ngimel
2022-12-06 21:28:17 +00:00
Animesh Jain
c1950620c5 [decomp] Fix native_batch_norm_backward dtype of dweight and dbias (#89740)
Discovered while debugging an accuracy issue for Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89740
Approved by: https://github.com/soumith, https://github.com/ngimel
2022-11-29 03:15:20 +00:00
Jane Xu
8695f0cced Rectify native_batch_norm schema by splitting it into two legit schemas (#88697)
Using the same repro from the issue (but with BatchNorm2D)

Rectifies native_batch_norm schema by splitting the schema into 2:
1. one will have NON-optional alias-able running_mean and running_var inputs
2. the other will just not have those parameters at all (no_stats variation)

**Calling for name suggestions!**

## test plan
I've added tests in test_functionalization.py as well as an entry in common_method_invocations.py for `native_batch_norm_legit`
CI should pass.

## next steps
Because of bc/fc reasons, we reroute native_batch_norm to call our new schemas ONLY through the python dispatcher, but in 2 weeks or so, we should make `native_batch_norm_legit` the official batch_norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88697
Approved by: https://github.com/albanD
2022-11-23 23:23:17 +00:00
lezcano
1d6a188d08 Reland Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761) (#84624)
Reland https://github.com/pytorch/pytorch/pull/81761

Differential Revision: [D39332292](https://our.internmc.facebook.com/intern/diff/D39332292)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84624
Approved by: https://github.com/kit1980
2022-11-22 07:53:24 +00:00
lezcano
3320915303 Fix decomp for embedding_backward and simplify the decomposition of embedding_dense and embedding_dense_backward (#87204)
See the title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87204
Approved by: https://github.com/Chillee
2022-11-16 17:46:54 +00:00
lezcano
e1ecf53d84 Simplify linspace decomp and increase its tolerance (#87203)
This is an interesting one

Since this is an operation that's intrinsically defined on the reals,
we should perform the ops on that dtype always, and just cast to
the desired dtype at the end. This simplifies the decomposition.

Now, I started looking at this one when I started seeing failures on a
test that's added in a later PR. What's going on here is that, by doing
an upcast to a higher dtype and then cast down to integers, sometimes
there's an off-by-one error. I think this is fine, as the decomposition
is more accurate than the original function, which goes in line with
the whole PrimTorch effort.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87203
Approved by: https://github.com/mruberry
2022-11-16 17:46:54 +00:00
Sherlock Huang
5faa2792fa Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
Approved by: https://github.com/ezyang
2022-11-15 13:34:45 +00:00
PyTorch MergeBot
eea506aee1 Revert "Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)"
This reverts commit 9eabcc370f.

Reverted https://github.com/pytorch/pytorch/pull/88761 on behalf of https://github.com/suo due to much broken 9eabcc370f
2022-11-14 01:58:47 +00:00
Sherlock Huang
9eabcc370f Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
Approved by: https://github.com/ezyang
2022-11-13 21:30:53 +00:00
Fabio Rocha
652af5ec15 upsample_*.vec ops are now CompositeImplicit (#85638)
It was previously CompositeExplicit but it was not really necessary.
See discussion in https://github.com/pytorch/pytorch/issues/85405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85638
Approved by: https://github.com/ezyang, https://github.com/lezcano, https://github.com/malfet, https://github.com/jansel
2022-11-09 09:58:04 +00:00
soulitzer
4c20c0509d Split out forward AD tests from test_ops_gradients and reenable slow gradcheck CI (#88216)
Fixes: https://github.com/pytorch/pytorch/issues/88010

This PR does a couple things to stop slow gradcheck from timing out:
- Splits out test_ops_fwd_gradients from test_ops_gradients, and factors out TestFwdGradients and TestBwdGradients which both inherit from TestGradients, now situated in common_utils (maybe there is a better place?)
- Skips CompositeCompliance (and several other test files) for slow gradcheck CI since they do not use gradcheck
- because test times for test_ops_fwd_gradients and test_ops_gradients are either unknown or wrong, we hardcode them for now to prevent them from being put together. We can undo the hack after we see actual test times are updated. ("def calculate_shards" randomly divides tests with unknown test times in a round-robin fashion.)
- Updates references to test_ops_gradients and TestGradients
- Test files that are skipped for slow gradcheck CI are now centrally located in in run_tests.py, this reduces how fine-grained we can be with the skips, so for some skips (one so far) we still use the old skipping mechanism, e.g. for test_mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88216
Approved by: https://github.com/albanD
2022-11-03 00:20:45 +00:00
lezcano
faf9c47abb Simplify a few diagonal-related functions (#87180)
`diag` was unnecessarily implemented as a kernel rather than as a composite
function, which made it unnecessarily difficult (explicit backward + all it entails).

We also change a few uses of `diag` on 2D tensors for `diagonal()`. The
latter returns a view rather than creating a new tensor.

We also upgrade its meta implementation to a fully-fledged
decomposition

I tried implementing the backwards of `diagonal()` via `diag_scatter` (or better `diag_scatter_` to keep the perf) but functionalisation was failing and I was not sure how to fix this, so I moved on. It may be possible to simplify that one as well if @soulitzer or someone knows how to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87180
Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/mruberry
2022-10-24 06:11:53 +00:00
Peter Bell
6eeeb88172 OpInfo: Sample input cleanup (4/n) (#86324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324
Approved by: https://github.com/mruberry
2022-10-19 21:25:45 +00:00
PyTorch MergeBot
317eeb81c3 Revert "OpInfo: Sample input cleanup (4/n) (#86324)"
This reverts commit 2a6d37d23d.

Reverted https://github.com/pytorch/pytorch/pull/86324 on behalf of https://github.com/peterbell10 due to Caused tolerance issues in periodic test
2022-10-17 18:26:59 +00:00
Peter Bell
2a6d37d23d OpInfo: Sample input cleanup (4/n) (#86324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324
Approved by: https://github.com/mruberry
2022-10-16 19:12:44 +00:00
Khushi Agrawal
77d29bcee2 [primTorch] special: ndtr, ndtri, log_ndtr, erfcx (#86077)
- Adds prims and _refs for `erfcx` and `ndtri`.
- Adds _refs for `ndtr`, and `log_ndtr`.

cc @kshitij12345 @lezcano @mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86077
Approved by: https://github.com/mruberry
2022-10-13 01:18:30 +00:00
Jane Xu
6923dc3b59 Add module: decompositions as an owner to test_decomp.py (#86703)
so flaky tests can be attributed to @SherlockNoMad too 😛
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86703
Approved by: https://github.com/albanD
2022-10-11 17:23:36 +00:00
Peter Bell
3ec71fce79 Improve make_tensor performance for float and complex types (#85473)
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85473
Approved by: https://github.com/mruberry
2022-10-05 17:05:20 +00:00
PyTorch MergeBot
6db3539e70 Revert "Improve make_tensor performance for float and complex types (#85473)"
This reverts commit a76995e584.

Reverted https://github.com/pytorch/pytorch/pull/85473 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but it seems to cause a bunch of flaky test in pull an periodic
2022-09-29 20:06:52 +00:00
Peter Bell
a76995e584 Improve make_tensor performance for float and complex types (#85473)
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85473
Approved by: https://github.com/mruberry
2022-09-29 11:46:09 +00:00
Animesh Jain
796da4df4d Return contiguous tensor from softmax decomposition (#85788)
Fixes https://github.com/pytorch/torchdynamo/issues/1135

Softmax decomp's output stride does not match with aten softmax output stride. Not sure if its desirable. Opening a PR for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85788
Approved by: https://github.com/ngimel, https://github.com/ezyang
2022-09-28 20:52:45 +00:00
Peter Bell
29c78266c0 test_decomp.py: Skip tests for embedding_backward bf16 (#84554)
`embedding_backward`'s decomposition is less accurate for bf16.
Currently bfloat16 is skipped in both forward and backward, but the
forward decomposition matches 1-1 with the ATen implementation so this
re-enables the test for the forwards decomposition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84554
Approved by: https://github.com/albanD
2022-09-28 19:32:54 +00:00
Edward Z. Yang
793488cda2 Revert "Revert "Symintifying slice ops (#85196)"" (#85746)
This reverts commit 3a171dfb0c.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85746
Approved by: https://github.com/albanD
2022-09-28 04:37:35 +00:00
PyTorch MergeBot
3a171dfb0c Revert "Symintifying slice ops (#85196)"
This reverts commit 4c01c51266.

Reverted https://github.com/pytorch/pytorch/pull/85196 on behalf of https://github.com/atalman due to Break internal build Exutorch
2022-09-27 18:01:27 +00:00
samdow
18d8c548f4 [Modes] remove enable and rewrite mode stack (squashed) (#84774)
Based on @ezyang's suggestion, mode stack now has "one true mode" which is the _only_ mode that can ever be active at the C++ level. That mode's torch dispatch is just to take the top mode in the stack, reenable itself (if we aren't at the end of the mode stack), and run the top mode's torch_{dispatch|function}

This maintains that in the middle of a mode's torch dispatch, the mode itself will not be active. It changes the function the user has to call to see what the current mode is (no longer queries the C++, it's python only) but allows the user to also see the entire mode stack easily

Removes `enable_torch_dispatch_mode` and `.restore()` since neither makes sense in this new setup

### Background
Why do we want this? Well, a pretty common pattern that was coming up was that users had to do something like

```python
## PRE-PR UX
def f(mode):
  with mode.restore():  # user needs to understand this restore thing?
    ...

with Mode() as m:
  pass
f(m)
```

Many users were getting error from forgetting to call `.restore` or from forgetting to add the (tbh weird) "mode instantiation"  step where they use the mode as a context manager with an empty body. Really, they wanted to treat modes like context managers and just write
```python
## FROM FEEDBACK, USER DESIRED CODE. POSSIBLE POST-PR
def f(mode):
  with mode:
    ...
f(Mode())
```

** Technical Details **
With the old mode stack, we basically had a linked list so the mode itself could only be used once and had a fixed parent. In this new design, the mode stack is just a python list that we're pushing to and popping from. There's only one mode that's ever active at the C++ level and it runs the next mode in the Python list. The modes don't have state on them anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84774
Approved by: https://github.com/ezyang, https://github.com/zou3519
2022-09-27 01:04:35 +00:00
Fabio Rocha
d5ce2bbed2 [primTorch] decompositions for upsample_bicubic2d (#85403)
FYI, this decomposition seems to be significantly slower than the lowering in torchinductor:

```
------------------------------------- upsample_bicubic2d -------------------------------------]
                                                              |  lowering  |  Inductor  |  Eager
32 threads: ------------------------------------------------------------------------------------
      (torch.Size([16, 4, 128, 256]),), ((512, 1024), True)   |    1.8     |   3.880    |   1.4
      (torch.Size([16, 4, 128, 256]),), ((512, 1024), False)  |    1.9     |   3.887    |   1.4
```

This seems related to the fact that in the lowering we can use int32s as the indices and in the decomp we can only use int64s (see https://github.com/pytorch/torchdynamo/issues/1293).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85403
Approved by: https://github.com/ngimel
2022-09-26 20:11:23 +00:00
Fabio Rocha
ffaff8896a Removed None arg check in test/test_decomp.py (#85402)
Not sure why this check was necessary? Tests seem to run fine without
it.
There were definitely tests this was skipping before that it shouldn't,
e.g., pretty much all of the tests for `torch.nn.functional.interpolate`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85402
Approved by: https://github.com/ezyang
2022-09-24 11:37:27 +00:00
Edward Z. Yang
4c01c51266 Symintifying slice ops (#85196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85196
Approved by: https://github.com/ezyang
2022-09-23 22:01:32 +00:00
Horace He
2f4a517d67 Ported matmul compositeimplicitautograd impl into core (#85239)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85239
Approved by: https://github.com/ezyang, https://github.com/lezcano
2022-09-21 09:25:24 +00:00
Horace He
4bdc0af53d Added support for symbolic is_contiguous (#84829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84829
Approved by: https://github.com/ezyang
2022-09-16 04:54:01 +00:00
Horace He
1459a909b4 Added mv, mm, and binary_cross_entropy_with_logits decomps (#84451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84451
Approved by: https://github.com/ngimel
2022-09-08 17:56:18 +00:00
PyTorch MergeBot
166dec74b5 Revert "Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761)"
This reverts commit 65beff5acb.

Reverted https://github.com/pytorch/pytorch/pull/81761 on behalf of https://github.com/mehtanirav due to Breakages in pytorch/glow
2022-09-06 22:31:14 +00:00
Fabio Rocha
91a5f52f51 Decomp for nn.functional.grid_sampler_2d (#84350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84350
Approved by: https://github.com/jansel, https://github.com/Lezcano
2022-09-05 21:33:26 +00:00
lezcano
65beff5acb Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761)
`torch.norm` is very odd. Some notable issues are:

- The default value of `"fro"` in `torch.norm` has an odd behaviour when `dim=None`. This is handled in the new dispatch
- The treatment of the `dtype` argument in `torch.norm` was completely wrong. This should fix it
- Some `out=` variants in the previous implementation were also wrong. This should fix those.
- This new dispatch should make some paths much faster. For example, `torch.norm(x)` where `x` is complex.

I'll try to make the changes in these PRs as incremental as possible as this is a tricky one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81761
Approved by: https://github.com/ngimel
2022-09-02 19:12:25 +00:00
Elias Ellison
f701cb04fb Test Dynamo CI w Fake Tensors (#84282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84282
Approved by: https://github.com/anijain2305
2022-09-01 00:15:05 +00:00
Edward Z. Yang
ad44670fa1 Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628)" (#84173)
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"

Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9

Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159

Also update Metal registrations for C++ registration changes.

Also update NNPI registration to account for tightened schema checking

Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
2022-08-29 18:01:07 +00:00
Ivan Yashchuk
3aae6ff1e1 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-28 18:45:25 +00:00
PyTorch MergeBot
b159a5230f Revert "Add nvprims.var_mean (#83508)"
This reverts commit 7e7694b661.

Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-28 11:30:27 +00:00
Ivan Yashchuk
7e7694b661 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-27 09:05:20 +00:00
PyTorch MergeBot
c7edcd6968 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 9790d90e4b.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487
2022-08-27 01:23:17 +00:00
Edward Z. Yang
9790d90e4b Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-26 01:35:40 +00:00