Commit Graph

343 Commits

Author SHA1 Message Date
Michael Lazos
ce5adc5c05 [Dynamo] add support for torch._C._is_torch_function_all_disabled (#149490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149490
Approved by: https://github.com/StrongerXi
ghstack dependencies: #149489
2025-03-20 22:19:55 +00:00
Shuai Yang
00a2c68f67 Fix a typo "trochrec" to "torchrec" (#149542)
Summary: As titled, the path is incorrect due to the typo

Test Plan: CI

Differential Revision: D71490709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149542
Approved by: https://github.com/williamwen42
2025-03-20 10:14:23 +00:00
Marko Radmilac
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
PyTorch MergeBot
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
Marko Radmilac
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
Yuanhao Ji
0a948f705b [Dynamo] Fix AssertionError when dynamo traces torch.functional.xxx() functions (#148075)
Fixes #147840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148075
Approved by: https://github.com/yanboliang
2025-02-28 15:09:11 +00:00
Xuehai Pan
3ce352e389 [BE][PYFMT] migrate PYFMT for torch._dynamo to ruff format (#144549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549
Approved by: https://github.com/jansel
2025-02-28 03:03:53 +00:00
Ryan Guo
f46f0e465c [dynamo] Initial support for nonstrict_trace (#146367)
## Context
> **Note:** `mark_traceable` got renamed to `nonstrict_trace` after
> offline discussion. The reasons are (1) it aligns with `torch.export`'s
> `nonstrict` notion, and (2) it's more definitive in behavior suggestion.

1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0)
2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn)

## Summary
This patch adds a `torch._dynamo.nonstrict_trace` decorator, which
currently is an enhanced version of `torch._dynamo.allow_in_graph` (see
docstring for their differences). Specifically, this patch focuses on
the UI and functionality prototyping/plumbing.

The main enhancement is supporting more input types, and the
implementation challenge lies in reconstructing the input objects from
Dynamo `VariableTracker` (while accounting for buffered side-effects and
guards).  This patch takes a middle-ground (simple implementation with a
bit of user labor), by
1. asking the user to provide pytree registration for non-proxy-able
   input types,
2. letting Dynamo trace through `pytree_flatten` (which accounts for
   buffered side-effects and guards automatically),
3. and passing in the TreeSpec as a graph attribute constant into
   `torch._higher_order_ops.flat_apply` (which unflattens the inputs and
   invokes the underlying function).

## Next Steps
In subsequent patches, we will try to support the following:
- annotating on class method
- reads to global tensors
- inputs that contains `pytree.register_constant`-ed instances.
- function as input
- more output types (e.g., any pytree-registered type)
- `torch.nn.Module` as inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367
Approved by: https://github.com/zou3519
ghstack dependencies: #146714
2025-02-26 19:47:39 +00:00
Luca Wehrstedt
60d94ea22b Add option to limit number of SMs used by matmul kernels (#147966)
Resubmission of #144974 which was reverted for unrelated reasons.

Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966
Approved by: https://github.com/danthe3rd
2025-02-26 12:01:12 +00:00
PyTorch MergeBot
1e894d2635 Revert "Add option to limit number of SMs used by matmul kernels (#144974)"
This reverts commit af2d63637e.

Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))
2025-02-25 22:46:38 +00:00
Luca Wehrstedt
af2d63637e Add option to limit number of SMs used by matmul kernels (#144974)
Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974
Approved by: https://github.com/eqy, https://github.com/albanD
2025-02-25 10:19:19 +00:00
clr
166419b9c1 dynamo: Don't crash when encountering a object with no __name__ (#147246)
This was triggering on ScriptFunctions. Note that other than badly implemented c functiosn, this seems to be almost impossible to trigger, so I wrote a smaller unit test, rather than a full repro. Let me know if people feel strongly and want a full reproduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147246
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007
2025-02-18 20:35:49 +00:00
Aaron Gokaslan
6344ca1dd4 [BE][Ez]: Apply FURB188: use str remove(pre|suf)fix (#146997)
Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997
Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel
2025-02-14 03:38:07 +00:00
rzou
5dab0aeef0 [SkipFiles] Some more cleanup (#147013)
This isn't a no-op but I think it's fine. It changes the case where a
function f1 in a module in MOD_SKIPFILES calls a function f2 in one of
the deleted modules. Previously f2 would have been skipped, now f2 gets
inlined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147013
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016, #147012
2025-02-13 01:18:47 +00:00
rzou
fddaa2958b [SkipFiles] Some more cleanup (#147012)
I think these are all no-ops.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147012
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016
2025-02-13 01:18:47 +00:00
rzou
87ebd77b34 Add some more docs to trace_rules.py (#147016)
After discussing with Yanbo we wanted to record the behavior down so we
don't need to rederive them in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147016
Approved by: https://github.com/yanboliang
2025-02-13 01:18:39 +00:00
Animesh Jain
d6513f3246 [dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819)
This PR adds support for list subclasses. Among other things are

1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it.
2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`).
3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods.

If this PR is hard to review, please let me know. I can break it into several small PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-02-12 17:46:02 +00:00
rzou
5235a18cd6 [SkipFiles] remove some more stuff from MOD_SKIPLIST (#146876)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146876
Approved by: https://github.com/anijain2305
ghstack dependencies: #146854
2025-02-11 15:00:56 +00:00
rzou
a7fe384d0e Remove torch._higher_order_ops from MOD_SKIPLIST (#146853)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146853
Approved by: https://github.com/williamwen42
2025-02-11 04:38:26 +00:00
rzou
275c034b16 [SkipFiles] remove some stuff from MOD_SKIPLIST (#146854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146854
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2025-02-11 01:34:46 +00:00
Guilherme Leobas
8603a1c870 Suport generators (#141055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055
Approved by: https://github.com/zou3519
2025-02-08 22:42:12 +00:00
eellison
92b7e610ab [Inductor changes] Invoke Quant (#139102)
Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0).

The primary motivations are

- Unifying scattered reasoning for quant operators throughout the code base

- Easy of pattern matching - see this very large pattern match expression [here](949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426). Compared to the pattern I have in the tests:

```
        @register_graph_pattern(
            CallFunction(
                torch.ops.aten.mm,
                CallFunction(
                    torch.ops.higher_order.invoke_quant,
                    Ignored(),
                    Ignored(),
                    Ignored(),
                    scheme="nf4",
                ),
                Arg(),
            ),
            pass_dict=test_pass,
        )
```

- Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul.

Example graph:

``` Python
 ===== AFTER POST GRAD =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
         # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
        repeated_subgraph0 = self.repeated_subgraph0
        invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4');  repeated_subgraph0 = arg0_1 = arg1_1 = None
        return (invoke_quant,)

    class repeated_subgraph0(torch.nn.Module):
        def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
             # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
            mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1);  arg0_1 = None
            add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1);  mul = arg1_1 = None
            return add
```

The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, *args, scheme=None)` where the scheme will not always be present.

I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing.

```
invoke_quant = InvokeQuant(codegen_low_precision=True)
invoke_quant(gn, (x, y), scheme="nf4")
```
Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162.

Feedback welcome.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102
Approved by: https://github.com/Chillee
2025-02-08 19:30:19 +00:00
Animesh Jain
5f53889850 [dynamo][builtin-skipfiles-cleanup] Remove inspect (#146116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146116
Approved by: https://github.com/williamwen42, https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #146322
2025-02-04 03:36:07 +00:00
Nikita Shulga
e56dcf2772 [CPUInductor] Fix SVE256 detection (#146207)
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes https://github.com/pytorch/pytorch/issues/145441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207
Approved by: https://github.com/angelayi
2025-02-01 18:51:34 +00:00
Animesh Jain
781aceee9c [dynamo] Revert abc change due to internal failures (#146177)
xref - https://www.internalfb.com/tasks/?t=191383874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146177
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146141
2025-01-31 21:28:06 +00:00
Animesh Jain
667b94d1c2 [hotfix][dynamo] Skip linecache due to a flaky issue (#146141)
A large number of jit + dynamo wrapped tests fail in linecache tracing.
We need further debugging. Skipping for now to stem the bleeding.

https://github.com/pytorch/pytorch/issues/146076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146141
Approved by: https://github.com/StrongerXi
2025-01-31 17:45:06 +00:00
Animesh Jain
4499d60d56 [dynamo][builin-skipfiles-cleanup] Remove types (#145909)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909
Approved by: https://github.com/zou3519
ghstack dependencies: #145856, #145875, #145878, #145892
2025-01-29 16:47:02 +00:00
Animesh Jain
3f77002b96 [dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875, #145878
2025-01-29 05:30:06 +00:00
Animesh Jain
236793684d [dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875
2025-01-29 05:30:06 +00:00
Animesh Jain
a479656cd2 [dynamo][builtin-skipfiles-removal] Remove logging (#145875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875
Approved by: https://github.com/williamwen42
ghstack dependencies: #145856
2025-01-29 05:29:58 +00:00
Animesh Jain
64ee57847b [dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856)
[dynamo][builtin-skipfiles-cleanup] Remove more builtins

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856
Approved by: https://github.com/zou3519
2025-01-29 05:29:47 +00:00
Animesh Jain
80a0412b76 [dynamo][builtin-skipfiles-cleanup] Remove posixpath (#145828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145828
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753, #145826
2025-01-28 16:14:34 +00:00
Animesh Jain
6824a4a75d [dynamo][builtin-skipfiles-cleanup] Remove re (#145826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145826
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753
2025-01-28 16:14:34 +00:00
Animesh Jain
4307e6c008 [dynamo][builtin-skipfile-cleanup] Remove signal (#145753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145753
Approved by: https://github.com/zou3519
ghstack dependencies: #145744
2025-01-28 16:14:23 +00:00
Animesh Jain
5c5306e8bc [dynamo][builtin-skiplist-cleanup] Remove weakref (#145744)
WeakKeyDictionary already works very nicely with the UserDefinedObject Variable Tracker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145744
Approved by: https://github.com/jansel
2025-01-28 07:55:12 +00:00
rzou
ea141d8134 functional compiled autograd (#144707)
This PR squashes together the following commits:

https://github.com/pytorch/pytorch/pull/144115
https://github.com/pytorch/pytorch/pull/143417
https://github.com/pytorch/pytorch/pull/143405
https://github.com/pytorch/pytorch/pull/143387
https://github.com/pytorch/pytorch/pull/143304
https://github.com/pytorch/pytorch/pull/143296

This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses.

For more information, please read the commit messages for each PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707
Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel
2025-01-27 05:20:56 +00:00
Animesh Jain
53fc921ce2 [dynamo][trace-rules-cleanup] Remove functools from the Builtins skiplist (#145519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145519
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2025-01-24 06:02:03 +00:00
PyTorch MergeBot
3f6cfd0156 Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417)"
This reverts commit 99dd1bf1b9.

Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
Nikhil Gupta
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
Animesh Jain
5a18f1e1eb [dynamo] Support fx map_aggregate (#145351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351
Approved by: https://github.com/zou3519
2025-01-23 03:19:30 +00:00
rzou
99dd1bf1b9 [compiled autograd] stop specializing on metadata during initial trace (#143417)
The previous PRs built up to this. We change compiled autograd's initial
trace to stop baking in metadata.

While tracing, we allocate some weirdly shaped tensors that we can put
proxies on. The initial trace should not be accessing any metadata of
these tensors (it will likely error out if it does because of how weird
the shapes are).

This involved fixing some various sites where we do specialize on the
metadata, like:
- we change CopySlices's apply_with_saved to proxy some calls
  into the graph (this change is fairly hard to split out by itself).
- we stop calling InputBuffer::add
- we delete the weird metadata from the graph so that no graph passes
  can make use of it.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387, #143405
2025-01-22 21:51:07 +00:00
albanD
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
rzou
1e8d6d6f0e [SkipFiles] New modules added to torch.* are inlined by default (#145279)
This PR:
- makes it so that new modules added to torch are inlined by default
- adds a list of the previously "skipped by default" modules to avoid
  regressing anything. This is a new MOD_SKIPLIST list that is consulted
  in trace_rules.check_file.
- Follow-up work will go through this list, one-by-one, and try to delete
  modules. I think we should be able to delete almost everything,
  except for torch._dynamo.

Test Plan
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145279
Approved by: https://github.com/yanboliang
2025-01-21 23:24:12 +00:00
Aaron Orenstein
a79100ab11 PEP585 update - torch/_dynamo (#145105)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105
Approved by: https://github.com/bobrenjc93
2025-01-18 20:47:11 +00:00
Yanbo Liang
43a00d73b3 [Trace Python Dispatcher] Support FuncTorchInterpreter (#144444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144444
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #144439
2025-01-17 02:26:37 +00:00
James Wu
7d71ddbe5d Add non_c_binding torch functions to allowlist for AOTAutogradCache, confirm no special handlers for them (#144802)
Differential Revision: [D68173093](https://our.internmc.facebook.com/intern/diff/D68173093/)

This diff allows any function in torch_non_c_binding_in_graph_functions to be safe to cache. These functions should be safe to cache because they are part of the torch API, and do not save global state (or if they do, dynamo creates unique guards around the constants they return).
A function that's allowed in a dynamo graph is safe to cache for AOTAutograd purposes as long as:
- It's functional (i.e. does not access global state);
- or its value is constant folded away (and guarded against by dynamo)

The tricky cases are functions that dynamo uses special handlers to track. These special handlers can sometimes close over stuff that's safe for dynamo locally, but isn't encoded anywhere when cached across processes. An example of this is `DTensor.from_local`, where various DeviceMesh information doesn't change in the same dynamo process, but can change across multiple processes. The handler for `DTensor.from_local` closes over these and dynamo creates a proxy for the function call. This is not safe to cache.

That said, most special handlers are in fact functional and safe. So I add a unit test to test_trace_rules.py that confirms that any function with special handlers in dynamo added to this list needs to be audited to be safe to cache.

The list of safe handlers there either:
- Don't access global state;
- Guard on global state; or
- Always returns a constant that never changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144802
Approved by: https://github.com/bdhirsh
2025-01-15 05:41:36 +00:00
Yanbo Liang
430d54ee20 [Dynamo] Add functorch C++ bindings as in graph functions (#144309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144309
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306, #144307, #144308
2025-01-07 22:25:01 +00:00
Yanbo Liang
d146763f6f [Dynamo] Inline functions in torch._ops (#144308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144308
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306, #144307
2025-01-07 22:25:01 +00:00
Yanbo Liang
242a4a3f83 [Dynamo] Inline functions in torch._functorch.pyfunctorch (#144307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144307
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306
2025-01-07 22:24:53 +00:00
Yanbo Liang
4417be65e5 [Dynamo] Inline functions in torch._functorch.autograd_function (#144306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144306
Approved by: https://github.com/williamwen42
2025-01-07 22:24:46 +00:00