Commit Graph

412 Commits

Author SHA1 Message Date
PyTorch MergeBot
c0996866f4 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit 4305c64fea.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/izaitsevfb due to breaking internal builds(take 3) ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-1986338164))
2024-03-08 20:01:03 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
Bin Bao
bd19d6d822 [AOTI] Use torchgen to generate C shim functions (#120513)
Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as

* Use plain C data types to pass parameters
* Use AtenTensorHandle to pass at::Tensor
* Use pointer type to pass optional parameter
* Use pointer+length to pass list
* Use device_type+device_index to pass device
* When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values

https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis.

This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage.

Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513
Approved by: https://github.com/jansel
2024-03-05 04:28:44 +00:00
Jacob Szwejbka
a7c799fb85 [executorch] Add support for method variants in aten executorch code gen (#121016)
Summary: Title.

Test Plan: The added unittest

Differential Revision: D54423028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121016
Approved by: https://github.com/larryliu0820
2024-03-01 20:33:02 +00:00
Pearu Peterson
70d4d109f2 Make SparseCsr a functionality dispatch key (#120703)
As in the title.

To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703
Approved by: https://github.com/ezyang
2024-03-01 13:28:46 +00:00
angelayi
f064dec7e0 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-27 01:34:59 +00:00
PyTorch MergeBot
b01bd1f7a1 Revert "Add torch.ops.aten.print (#120295)"
This reverts commit 3b944113c8.

Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))
2024-02-27 01:18:48 +00:00
PyTorch MergeBot
8a32a07856 Revert "Add meta device support to sparse compressed tensors (#120498)"
This reverts commit 5d71ba6885.

Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))
2024-02-26 15:59:36 +00:00
Pearu Peterson
5d71ba6885 Add meta device support to sparse compressed tensors (#120498)
As in the title.

Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498
Approved by: https://github.com/ezyang
2024-02-25 16:50:17 +00:00
Aaron Gokaslan
33938cfddd [BE][Ez] Update ruff to 0.2.2 (#120517)
Updates ruff to 0.2.2. This updates the config and handles some of the new rules that have come out of preview.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120517
Approved by: https://github.com/albanD
2024-02-24 07:13:53 +00:00
Isuru Fernando
c3496d50f0 Fix torch.return_types init signature (#119284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284
Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan
2024-02-23 21:52:34 +00:00
angelayi
3b944113c8 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-23 17:01:22 +00:00
Yu, Guangye
5c46600f84 [RELAND] refactor lazy init to device-agnostic (#119248)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846).
This is a common PR, and does not trigger xpu ciflow.

Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman
2024-02-07 15:58:51 +00:00
PyTorch MergeBot
ab613a4019 Revert "refactor lazy init to device-agnostic (#118846)"
This reverts commit 520771d7b3.

Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11  ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))
2024-02-05 18:06:30 +00:00
Yu, Guangye
520771d7b3 refactor lazy init to device-agnostic (#118846)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
2024-02-02 12:10:39 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
Aaron Gokaslan
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Masaki Kozuki
67d8db9252 Remove semicolon after return_from_mutable_noop_redispatch (#118538)
[`return_from_mutable_noop_redispatch`](65f8276bc6/torchgen/gen_functionalization_type.py (L477)) calls
[`return_str`](65f8276bc6/torchgen/gen_functionalization_type.py (L159-L166)). `return_str`'s output includes `;` so I think the semicolon after the callsite of `return_from_mutable_noop_redispatch` is not needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118538
Approved by: https://github.com/colesbury
2024-01-30 02:22:42 +00:00
Edward Z. Yang
119b66ba16 Use strict to toggle strict options in MYPYSTRICT (#118479)
As we force a specific version of mypy, it's OK to use the agglomerated flag.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475
2024-01-28 19:22:22 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
albanD
24133e44b1 Fix return type hint for list types (#118238)
All single element list types are `Tensor[]` so they will always be Tuple.
I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238
Approved by: https://github.com/ezyang
2024-01-25 23:35:20 +00:00
Sam Larsen
40a6710ad3 Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`

Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2024-01-17 15:32:18 +00:00
Joel Schlosser
5aac95c713 Introduce slice_inverse() op (#117041)
Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR:
* Introduces the op itself
* Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly
* Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()`
    * Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption)

@albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041
Approved by: https://github.com/bdhirsh
2024-01-16 23:44:54 +00:00
Edward Z. Yang
003c900d5e Add _assert_scalar (#117378)
Peeled off from https://github.com/pytorch/pytorch/pull/114148, because that PR is going to take a while to actually land.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117378
Approved by: https://github.com/jansel
2024-01-14 00:50:36 +00:00
PyTorch MergeBot
1174e82bde Revert "Add _assert_scalar and teach Inductor to codegen it (#114148)"
This reverts commit b6028acfa4.

Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))
2024-01-11 02:30:22 +00:00
Joel Schlosser
16d69290c6 Use view name instead of view_copy name for functional inverses (#117056)
Ex: `unsqueeze_copy_inverse()` -> `unsqueeze_inverse()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117056
Approved by: https://github.com/bdhirsh
2024-01-10 00:52:36 +00:00
Edward Z. Yang
b6028acfa4 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
2024-01-09 23:21:26 +00:00
Joel Schlosser
52f0457d7d Support view returns for functional inverses on narrowing views (#115893)
Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

The following functional inverses are currently implemented scatter-style and thus never return views:
* `as_strided_copy_inverse()`
* `diagonal_copy_inverse()`
* `expand_copy_inverse()`
* `select_copy_int_inverse()`
* `slice_copy_Tensor_inverse()`
* `split_copy_Tensor_inverse()`
* `split_with_sizes_copy_inverse()`
* `unbind_copy_int_inverse()`
* `unfold_copy_inverse()`

We need to get actual views for the introduction of reverse view funcs coming next.

Details:
* Use `as_strided()` to implement actual view inverses for the above
    * Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization
* Introduce `InverseReturnMode` enum for customization of functional inverses
    * `AlwaysView` - always return an actual view; needed for reverse view_funcs()
    * `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch)
    * `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible
* Make sure functionalization works as before
    * Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise
    * Adds tests to ensure old behavior for above inverses **in functionalization**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893
Approved by: https://github.com/bdhirsh
2023-12-21 21:39:22 +00:00
PyTorch MergeBot
497777e302 Revert "Mark set_ as an inplace view op (#115769)"
This reverts commit cd449e260c.

Reverted https://github.com/pytorch/pytorch/pull/115769 on behalf of https://github.com/jeanschmidt due to breaking landing signals internally, more details on the diff, author is tagged ([comment](https://github.com/pytorch/pytorch/pull/115769#issuecomment-1866846607))
2023-12-21 19:53:32 +00:00
Aaron Gokaslan
ee5d981249 [BE]: Enable RUFF PERF402 and apply fixes (#115505)
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
2023-12-20 18:01:24 +00:00
Sam Larsen
cd449e260c Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2023-12-19 23:08:05 +00:00
Tugsbayasgalan Manlaibaatar
d85314c95c Support Predispatch functionalization (#113728)
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:

1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.

2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.

3.  We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.

Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
2023-12-19 20:28:35 +00:00
cdzhan
99554112d3 [pytorch] add namespace for optTypeMetaToScalarType in codegen to avoid not declared when compile (#115623)
Fixes compilation failure in some environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115623
Approved by: https://github.com/albanD
2023-12-13 00:59:01 +00:00
Jesse Cai
4471fe6c39 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_sparse_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 23:08:51 +00:00
PyTorch MergeBot
40a14e07ef Revert "[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)"
This reverts commit 1e5636f791.

Reverted https://github.com/pytorch/pytorch/pull/115178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the Window build failure looks legit 1e5636f791 ([comment](https://github.com/pytorch/pytorch/pull/115178#issuecomment-1850605711))
2023-12-11 18:07:17 +00:00
Jesse Cai
1e5636f791 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 15:47:28 +00:00
Mengwei Liu
898554a3a3 [torchgen] Add logic in custom ops to return empty tensor (#114143)
Summary: Add two logic:

1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor.
2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors.

Test Plan: Rely on new unit tests

Differential Revision: D51471651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143
Approved by: https://github.com/cccclai
2023-12-08 17:03:44 +00:00
voznesenskym
ddf1cb7870 AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-28 19:33:35 +00:00
Tarun Karuturi
39f16c221e Adding event_tracer evalue logging calls in codegen (#114584)
Summary:
This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API.

When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead.

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D51534590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584
Approved by: https://github.com/larryliu0820
2023-11-28 18:32:05 +00:00
Nikita Shulga
7c98bac4a0 [BE] Speedup register schema compilation (#114438)
For some reason, inlining initializer list into a std::vector takes a lot of time using clang-15. But considering that there are only dozen or so distrinct tags, creating them once and pass as def argument should not affect runtime speed at all, but this significantly improves compilation time. On Mac M1 it reduces time needed to compiler RegisterSchema.cpp from 50 to 3 seconds.

Special case empty tags, to keep torch_gen tests happy

Before
```
% /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 131.8054 seconds (132.5540 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
  43.6364 ( 33.2%)   0.0919 ( 30.1%)  43.7282 ( 33.2%)  43.9658 ( 33.2%)  536345245380  ModuleInlinerWrapperPass
  43.6291 ( 33.2%)   0.0891 ( 29.2%)  43.7182 ( 33.2%)  43.9549 ( 33.2%)  536264096394  DevirtSCCRepeatedPass
  42.3766 ( 32.2%)   0.0185 (  6.1%)  42.3951 ( 32.2%)  42.6198 ( 32.2%)  523040901767  GVNPass
   0.4085 (  0.3%)   0.0040 (  1.3%)   0.4125 (  0.3%)   0.4195 (  0.3%)  4106085945  SimplifyCFGPass
   0.3611 (  0.3%)   0.0115 (  3.8%)   0.3726 (  0.3%)   0.3779 (  0.3%)  4864696407  InstCombinePass
   0.1607 (  0.1%)   0.0088 (  2.9%)   0.1695 (  0.1%)   0.1720 (  0.1%)  1780986175  InlinerPass
   0.0865 (  0.1%)   0.0024 (  0.8%)   0.0889 (  0.1%)   0.0914 (  0.1%)  1489982961  SROAPass
   0.0750 (  0.1%)   0.0013 (  0.4%)   0.0763 (  0.1%)   0.0764 (  0.1%)  620016338  SCCPPass
   0.0661 (  0.1%)   0.0040 (  1.3%)   0.0701 (  0.1%)   0.0735 (  0.1%)  592027163  EarlyCSEPass
...
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 48.2802 seconds (48.8638 wall clock)
...
 ```

After
```
% /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 1.2920 seconds (1.3187 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
   0.3070 ( 27.6%)   0.0547 ( 30.2%)   0.3617 ( 28.0%)   0.3654 ( 27.7%)  3719690895  ModuleInlinerWrapperPass
   0.3024 ( 27.2%)   0.0525 ( 29.0%)   0.3549 ( 27.5%)   0.3585 ( 27.2%)  3653363330  DevirtSCCRepeatedPass
   0.0619 (  5.6%)   0.0073 (  4.0%)   0.0692 (  5.4%)   0.0711 (  5.4%)  868136227  InstCombinePass
   0.0601 (  5.4%)   0.0065 (  3.6%)   0.0666 (  5.2%)   0.0679 (  5.1%)  696430647  InlinerPass
   0.0363 (  3.3%)   0.0033 (  1.8%)   0.0396 (  3.1%)   0.0425 (  3.2%)  535426974  SimplifyCFGPass
   0.0280 (  2.5%)   0.0069 (  3.8%)   0.0348 (  2.7%)   0.0358 (  2.7%)  378716394  BlockFrequencyAnalysis
   0.0208 (  1.9%)   0.0049 (  2.7%)   0.0257 (  2.0%)   0.0262 (  2.0%)  283689627  BranchProbabilityAnalysis
   0.0239 (  2.1%)   0.0002 (  0.1%)   0.0241 (  1.9%)   0.0241 (  1.8%)  219122704  OpenMPOptCGSCCPass
   0.0174 (  1.6%)   0.0015 (  0.8%)   0.0189 (  1.5%)   0.0192 (  1.5%)  215583965  GVNPass
   0.0153 (  1.4%)   0.0025 (  1.4%)   0.0178 (  1.4%)   0.0187 (  1.4%)  184232295  EarlyCSEPass
...
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.9128 seconds (3.1027 wall clock)
...
```

And the generated schema file looks as follows:
```cpp
TORCH_LIBRARY(aten, m) {
  const std::vector<at::Tag> tags_0 = {at::Tag::pt2_compliant_tag};
  m.def("_cast_Byte(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Char(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Double(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Float(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Int(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Long(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Short(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Half(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_backward(Tensor self, Tensor[] inputs, Tensor? gradient=None, bool? retain_graph=None, bool create_graph=False) -> ()", tags_0);
  m.def("set_data(Tensor(a!) self, Tensor new_data) -> ()", tags_0);
  m.def("data(Tensor self) -> Tensor", tags_0);
  m.def("is_leaf(Tensor self) -> bool", tags_0);
  m.def("output_nr(Tensor self) -> int", tags_0);
  m.def("_version(Tensor self) -> int", tags_0);
  m.def("requires_grad_(Tensor(a!) self, bool requires_grad=True) -> Tensor(a!)", tags_0);
  m.def("retain_grad(Tensor(a!) self) -> ()", tags_0);
  m.def("retains_grad(Tensor self) -> bool", tags_0);
  m.def("_fw_primal(Tensor(a) self, int level) -> Tensor(a)", tags_0);
  m.def("_make_dual(Tensor(a) primal, Tensor tangent, int level) -> Tensor(a)", tags_0);
  m.def("_unpack_dual(Tensor(a) dual, int level) -> (Tensor(a) primal, Tensor tangent)", tags_0);
  m.def("_new_zeros_with_same_feature_meta(Tensor self, Tensor other, *, int self_num_batch_dims=0) -> Tensor", tags_0);
  m.def("_has_same_storage_numel(Tensor self, Tensor other) -> bool", tags_0);
  const std::vector<at::Tag> tags_1 = {at::Tag::inplace_view, at::Tag::pt2_compliant_tag};
  m.def("rename_(Tensor(a!) self, Dimname[]? names) -> Tensor(a!)", tags_1);
  m.def("rename(Tensor(a) self, Dimname[]? names) -> Tensor(a)", tags_0);
  m.def("align_to(Tensor(a) self, Dimname[] names) -> Tensor(a)", tags_0);
  m.def("align_to.ellipsis_idx(Tensor(a) self, Dimname[] order, int ellipsis_idx) -> Tensor(a)", tags_0);
  m.def("align_as(Tensor self, Tensor other) -> Tensor", tags_0);
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114438
Approved by: https://github.com/zou3519
2023-11-27 23:33:04 +00:00
PyTorch MergeBot
3e1abde46d Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)"
This reverts commit a911b4db9d.

Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack #113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1822472206))
2023-11-22 10:13:48 +00:00
Antonio Kim
7fc292930c Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-21 23:07:21 +00:00
voznesenskym
a911b4db9d AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-21 01:52:46 +00:00
Edward Z. Yang
8c4812be80 Replace expect_int with guard_int (#113921)
The idea is that instead of erroring, we will just specialize at these sites.

Fixes https://github.com/pytorch/pytorch/issues/113142

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113921
Approved by: https://github.com/zou3519
2023-11-20 21:27:48 +00:00
Brian Vaughan
dbb96ef30d improve annotation device parameters where a device ordinal is allowed (#113647)
Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal.

`error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str | device"  [arg-type]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647
Approved by: https://github.com/albanD
2023-11-17 14:41:22 +00:00
Jane Xu
deec2380c7 Add 0dim Tensor overload for _foreach_div (#113688)
This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors.

Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise.

After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113688
Approved by: https://github.com/mlazos, https://github.com/albanD
2023-11-15 20:59:32 +00:00
George White
6c187246d6 Add support for float8_e4m3fnuz and _e5m2fnuz (#107586)
This PR relates to the feature in [this feature submission](https://docs.google.com/document/d/1pF2T1xz54IPg1jG7FhykbrpbcJZVelQw0v8vBaoLkfs/edit). It has been based on #104242 which adds similar float8 types.

These new types added in this PR are described in the paper at https://arxiv.org/abs/2206.02915. A brief description and comparison of the types with other float8 types can be also found in the [OpenXLA RFC](https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107586
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-11-15 15:01:11 +00:00
PyTorch MergeBot
252e68a83b Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 54493fe8c4.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))
2023-11-15 00:51:23 +00:00