Commit Graph

224 Commits

Author SHA1 Message Date
William Wen
95e976a63f [dynamo] recursively skip frames when Dynamo cache limit is hit (#135144)
Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723).

In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-06 21:38:53 +00:00
Animesh Jain
058a69d91a [fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928)
As Title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928
Approved by: https://github.com/ezyang
2024-09-05 22:05:54 +00:00
Laith Sakka
d6091c8726 Add compile time instruction count metric (#133834)
PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out
as of this diff, compile_time_instruction_count counts the number of instruction from within
convert_frame.compile_inner
```
update_hint_regression,compile_time_instruction_count,10522459165
```
 will add result from CI once populated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834
Approved by: https://github.com/aorenste
2024-08-27 23:29:02 +00:00
Edward Z. Yang
66d6d8b1b9 Support TORCH_COMPILER_COLLECTIVES envvar (#133696)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133696
Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o
2024-08-19 20:13:04 +00:00
Will Feng
6790eb52f9 [Traceable FSDP2] Set torch._dynamo.config.skip_fsdp_hooks to True by default (#133531)
Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531
Approved by: https://github.com/awgu
ghstack dependencies: #133532
2024-08-16 17:18:42 +00:00
Aart Bik
a8490a0762 [traced-graph][sparse] propagate sparsity in fx graph (#131920)
This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well.

Feature request:
https://github.com/pytorch/pytorch/issues/117188

Rebranch of older PRs (for history):
https://github.com/pytorch/pytorch/pull/131474
https://github.com/pytorch/pytorch/pull/128549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920
Approved by: https://github.com/ezyang
2024-08-05 15:49:53 +00:00
Xuehai Pan
e74ba1b34a [BE][Easy][15/19] enforce style for empty lines in import segments in torch/_d*/ (#129767)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767
Approved by: https://github.com/anijain2305
2024-07-31 21:18:11 +00:00
Animesh Jain
f44446e851 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #132053
2024-07-29 20:01:51 +00:00
PyTorch MergeBot
7ef927da15 Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 6de65d5dd4.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/atalman due to Broke CI: dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10132084288/job/28016215101) [HUD commit link](6de65d5dd4) ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2255839646))
2024-07-29 12:48:27 +00:00
Animesh Jain
6de65d5dd4 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #131744, #131928, #131948
2024-07-28 13:23:00 +00:00
PyTorch MergeBot
14920c149b Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 0455344777.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_quantized_linear_amx_dynamic_shapes_batch_size_16_in_features_4_out_features_64_bias_True_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/10102272826/job/27938970118) [HUD commit link](0455344777) not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2251609554))
2024-07-26 00:12:40 +00:00
Animesh Jain
0455344777 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang
ghstack dependencies: #131744
2024-07-25 22:14:17 +00:00
Edward Z. Yang
0c6f1ca064 Introduce torch._dynamo.config.enable_compiler_collectives for syncing compilation across ranks (#130935)
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:

1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks

The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.

Here is how automatic dynamic synchronization works:

1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.

For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
2024-07-24 11:24:11 +00:00
Pian Pawakapan
988ed4d5db [export] clean up allow_complex_guards_as_runtime_asserts flag (#130596)
Summary: removes underscore, cleans up dead code in DimConstraints

Test Plan: existing export tests

Reviewed By: angelayi

Differential Revision: D59612746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596
Approved by: https://github.com/angelayi
2024-07-12 17:17:11 +00:00
Yanbo Liang
111f9b5d44 [Dynamo] Add config to skip/inline torchrec (#129912)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912
Approved by: https://github.com/anijain2305
2024-07-03 00:14:51 +00:00
Elias Ellison
b8e5678ad2 Delete lazy ddp optimizer (#120727)
This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides.

Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727
Approved by: https://github.com/jansel, https://github.com/yf225
2024-06-26 21:53:54 +00:00
Will Feng
575bc1e3af [Reopen #114036] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295
Approved by: https://github.com/Chillee
2024-06-25 23:47:08 +00:00
Will Feng
dadc0ed4c8 [Traceable FSDP2] Add aot_eager backend E2E tests for transformer model (#129157)
This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157
Approved by: https://github.com/awgu
ghstack dependencies: #129203
2024-06-23 06:11:11 +00:00
Aaron Orenstein
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
Michael Lazos
2129903aa3 Properly detect nested torch function args (#127496)
Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed.
Fixes https://github.com/pytorch/pytorch/issues/127174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496
Approved by: https://github.com/anijain2305
2024-06-02 03:43:22 +00:00
Simon Fan
ec098b88b6 [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-31 04:38:20 +00:00
PyTorch MergeBot
ce63b676f3 Revert "[compiled autograd] torch.compile API (#125880)"
This reverts commit e1c322112a.

Reverted https://github.com/pytorch/pytorch/pull/125880 on behalf of https://github.com/atalman due to sorry your PR broke lint, need to revert ([comment](https://github.com/pytorch/pytorch/pull/125880#issuecomment-2139605376))
2024-05-30 13:53:31 +00:00
Simon Fan
e1c322112a [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-30 02:10:06 +00:00
Pian Pawakapan
8a31c2aa84 [export] allow complex guards as runtime asserts (#127129)
With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that.

For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs.

In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph.

Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes:
```
# reshape
def forward(self, x, y):  # x: [s0, s1], y: [s2]
    return x.reshape([-1]) + y  # guard s0 * s1 = s2

This leads to the following exported program

class GraphModule(torch.nn.Module):
    def forward(self, x: "f32[s0, s1]", y: "f32[s2]"):
        sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0)
        mul: "Sym(-s2)" = -1 * sym_size_int;  sym_size_int = None
        sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0)
        sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1)
        mul_1: "Sym(s0*s1)" = sym_size_int_1 * sym_size_int_2;  sym_size_int_1 = sym_size_int_2 = None
        add: "Sym(s0*s1 - s2)" = mul + mul_1;  mul = mul_1 = None
        eq: "Sym(Eq(s0*s1 - s2, 0))" = add == 0;  add = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0*s1 - s2, 0) on node 'eq'");  eq = None

        view: "f32[s0*s1]" = torch.ops.aten.view.default(x, [-1]);  x = None
        add_1: "f32[s0*s1]" = torch.ops.aten.add.Tensor(view, y);  view = y = None
        return (add_1,)
```
Another case is symbol divisibility:
```
def forward(self, x):  # x: [s0, s1]
    return x.reshape([-1, x.shape[0] - 1])  # Eq(Mod(s0 * s1, s0 - 1), 0)
```

Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        # check that negation of first guard also shows up as runtime assertion
        if x.shape[0] == y.shape[0]:  # False
            return x + y
        elif x.shape[0] == y.shape[0] ** 3:  # False
            return x + 2, y + 3
        elif x.shape[0] ** 2 == y.shape[0] * 3:  # True
            return x * 2.0, y * 3.0
```
For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard.

One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given.

As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related.

Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925)

This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work.

NOTE: will take naming change suggestions for the flag :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129
Approved by: https://github.com/avikchaudhuri
2024-05-29 17:15:25 +00:00
dshi7
0f67d38f0f add TORCHDYNAMO_CAPTURE_DYNAMIC_OUTPUT_SHAPE_OPS (#127017)
tlparse prints failure description like this

> dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True

adding os env var to set it easier for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017
Approved by: https://github.com/jackiexu1992
2024-05-25 05:42:41 +00:00
Alexandre Ghelfi
b3a8a3cbab Fix typos in torch._dynamo.config.py (#126150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126150
Approved by: https://github.com/Skylion007
2024-05-14 14:27:35 +00:00
Edward Z. Yang
2ba102f689 Implement native support for float inputs in Dynamo and ShapeEnv (#125325)
The big idea is that floats are treated as Tensors on input/output to the FX graph, but on the inside, we immediately call item() on the synthetic Tensor and record regular float operations on it. Canonicalization to Tensor operations will happen in a standalone FX pass. This behavior is controlled by `specialize_float` config variable when set to False.

The generated graph looks like this for the test `test_unspec_float_output`:

```
 def forward(self, L_x_: "f32[3]", L_y_: "f32[]"):
     l_x_ = L_x_
     l_y_ = L_y_

     # File: /data/users/ezyang/a/pytorch/test/dynamo/test_unspec.py:511 in f, code: return x + 1, y * 2
     add: "f32[3]" = l_x_ + 1;  l_x_ = None
     item: "Sym(zf0)" = l_y_.item();  l_y_ = None
     mul: "Sym(2*zf0)" = item * 2;  item = None
     scalar_tensor: "f32[]" = torch.scalar_tensor(mul);  mul = None
     return (add, scalar_tensor)
```

The ingredients:

* **torch/_dynamo/variables/builder.py** When `specialize_float` is False, we wrap float literals with `wrap_symfloat`. This is an unholy mashup of `wrap_symint` and `wrap_unspecialized_primitive`. The overall strategy is that we first generate a tensor argument (because that's what we want to show up into the FX graph), but then immediately call item() on the tensor argument to get a SymNodeVariable, which we will do the rest of the tracing with.  Importantly, this SymNodeVariable is backed with the source of the original float: this means we can guard on the resulting value (something we could NOT do with UnspecializedPythonVariable). This has to be done manually, because if you literally call item() on the tensor, you will end up with an unbacked float. There is a bit of copy paste from wrap_symint and wrap_unspecialized_primitive which we can try to factor out, but this really is its own thing and you should review every line of code in the function.
* **torch/fx/experimental/symbolic_shapes.py** We now can generate guards on float inputs, and these guards are handled inside of ShapeEnv. So we need to be able to allocate (backed!) float symbols, and produce guards for them. Fairly straightforward generalization.
* **torch/_dynamo/codegen.py** I also need to maintain the invariant that there are no float outputs to the FX graph. I chose to do this at codegen time. When we detect a SymNodeVariable on the return stack for a float, we on the fly convert it (via `as_tensor`) to a TensorVariable, which is the true output. We then special case the output bytecode to call item() on it again. The tensor conversion is memoized on SymNodeVariable since we typically run the code generation process twice.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125325
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-05-14 04:10:01 +00:00
Animesh Jain
0935b3d794 [dynamo] Turn on guard_nn_modules (#125202)
Turning on guard_nn_modules adds large number of guards, so we are bound to take a perf hit. But the perf hit is small. These are the numbers

![image](https://github.com/pytorch/pytorch/assets/13822661/c8793906-c8c7-432b-9af4-4594713067be)

First we observe that compared to Python guards, C++ guards give around 6x speedup. This reduces the total time spent in guards. This is shown in the last column (cpp_guards/inductor_optimized_latency). The worst model is around 1.61%, with most of the models below 1%. I think this is good enough signal to turn the config on.

One might also wonder how much guard slowdown occurs with `guard_nn_modules=True`. This is the table
![image](https://github.com/pytorch/pytorch/assets/13822661/932a885b-1c03-424b-8405-5bc8fd35dd39)

For most models, the guard overhead with nn module guards is under 2x. There are a few outliers, where the slowdown is really high and for those models we spend 1%-2% time in C++ guards as shown in first table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125202
Approved by: https://github.com/ezyang
2024-05-11 19:28:24 +00:00
Animesh Jain
b62e89c1b8 [dynamo] Do not turn on record relay with TORCH_COMPILE_DEBUG (#125488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125488
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-05-04 05:10:31 +00:00
Boyuan Feng
b91f83f181 [cudagraph] add config for cudagraph managed input mutation support (#124754)
Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature.

Test Plan: ci

Differential Revision: D56481353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754
Approved by: https://github.com/eellison
2024-04-24 04:23:53 +00:00
Animesh Jain
704fac5618 [dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231
Approved by: https://github.com/jansel
ghstack dependencies: #124230, #124237
2024-04-18 06:36:20 +00:00
PyTorch MergeBot
530bf391cc Revert "[dynamo] Turn on CPP guard manager (#123547)"
This reverts commit 3e98bdd66d.

Reverted https://github.com/pytorch/pytorch/pull/123547 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058337419))
2024-04-16 06:38:15 +00:00
willfengg
f1654fd4b0 [PT2D][FSDP] skip FSDP hooks base on dynamo config (#123021)
unit test: `pytest test/distributed/_composable/fsdp/test_fully_shard_compile.py`

For FSDP, we turn on/off compiling hooks base on `torch._dynamo.config.skip_fsdp_hooks`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123021
Approved by: https://github.com/yf225, https://github.com/anijain2305
2024-04-13 01:47:25 +00:00
Animesh Jain
3e98bdd66d [dynamo] Turn on CPP guard manager (#123547)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123547
Approved by: https://github.com/jansel
2024-04-12 23:30:56 +00:00
Jason Ansel
70b8c58f84 [dynamo] Emit warning to turn on capture_scalar_outputs (#123896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123896
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804
2024-04-12 19:03:13 +00:00
Oguz Ulgen
57a2032c7a Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-11 05:51:06 +00:00
PyTorch MergeBot
6b18daf205 Revert "Delete Lark (#123689)"
This reverts commit a631461eef.

Reverted https://github.com/pytorch/pytorch/pull/123689 on behalf of https://github.com/PaliC due to This PR seems to be breaking  test_binary_ufuncs.py ([comment](https://github.com/pytorch/pytorch/pull/123689#issuecomment-2048489549))
2024-04-10 21:48:04 +00:00
Oguz Ulgen
a631461eef Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-10 19:41:54 +00:00
Peter Bell
8865425ff7 [minifier] Add config flag to ignore non-fp values (#123006)
When minifying, the after-aot minifier ignores non-floating values by
default but does check them when running the the initial graph dump step.
This means we may capture a graph that doesn't fail the tester and doesn't have
any meaningful divergence.

For example, the derivative of `elu(x)` depends on `x > 0` so this value is
saved for backwards and so becomes a graph output. However, the difference
between `FLT_MIN` and `0` in `x` is now enough to trigger an accuracy failure.

I fix this by adding a config variable and environment variable to ignore these
non floating point values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123006
Approved by: https://github.com/ezyang
ghstack dependencies: #123005
2024-04-09 03:34:09 +00:00
Guilherme Leobas
84658d9c4f Enable capture_func_transforms by default (#122211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211
Approved by: https://github.com/zou3519
2024-04-05 03:29:11 +00:00
eellison
5f46312dbb Reapply "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864) (#122713)
This reverts commit 92ed8553a6.

No longer importing codecache or boxed_nop at top level, both of which casued issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122713
Approved by: https://github.com/anijain2305
2024-04-02 16:11:00 +00:00
Animesh Jain
60f3c092d4 [dynamo] Config option to Inline builtin nn module forward (#122725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122725
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647, #122716, #122769, #122818
2024-03-28 03:01:27 +00:00
Animesh Jain
c108696228 [dynamo][guards-cpp-refactor][easy] Env variable to turn on cpp manager (#122646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122646
Approved by: https://github.com/jansel
2024-03-27 19:40:37 +00:00
IvanKobzarev
9b095c3fe6 [dynamo] Config to not emit runtime asserts (#122603)
Repetition on squashed & merged by mistake https://github.com/pytorch/pytorch/pull/122406

Differential Revision: [D55312394](https://our.internmc.facebook.com/intern/diff/D55312394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122603
Approved by: https://github.com/ezyang
2024-03-25 21:17:44 +00:00
Animesh Jain
8860c625ea [dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726
Approved by: https://github.com/jansel
2024-03-19 03:11:31 +00:00
Animesh Jain
f84d560236 [dynamo] Raise accumulated cache size limit (#122130)
Fixes #114511

This was raised by IBM folks where the a LLM compile was failing because it had more than 64 layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122130
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #121954, #122005
2024-03-19 02:35:48 +00:00
Animesh Jain
92ed8553a6 Revert "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864)
This reverts commit 9373ad0bb8.

Revert "Add Cudagraphs disable checking (#121018)"

This reverts commit 4af0e634bf.

Causes compilation time increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864
Approved by: https://github.com/eellison
2024-03-15 00:03:09 +00:00
eellison
4af0e634bf Add Cudagraphs disable checking (#121018)
Adds the same cudagraphs disable checking from inductor - cudagraph trees to cudagraphs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121018
Approved by: https://github.com/ezyang
ghstack dependencies: #121017
2024-03-08 22:47:24 +00:00
PyTorch MergeBot
3d7cf8f392 Revert "Limit loop unrolling (#120023)"
This reverts commit 6cc7f9a2e6.

Reverted https://github.com/pytorch/pytorch/pull/120023 on behalf of https://github.com/anijain2305 due to breaks llms export ([comment](https://github.com/pytorch/pytorch/pull/120023#issuecomment-1974104633))
2024-03-02 00:04:08 +00:00
angelayi
c844b377fa [dynamo] Reorder logs (#116106)
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.

Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600

There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly

TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
2024-03-01 17:04:24 +00:00