Commit Graph

211 Commits

Author SHA1 Message Date
Pian Pawakapan
988ed4d5db [export] clean up allow_complex_guards_as_runtime_asserts flag (#130596)
Summary: removes underscore, cleans up dead code in DimConstraints

Test Plan: existing export tests

Reviewed By: angelayi

Differential Revision: D59612746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596
Approved by: https://github.com/angelayi
2024-07-12 17:17:11 +00:00
Yanbo Liang
111f9b5d44 [Dynamo] Add config to skip/inline torchrec (#129912)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912
Approved by: https://github.com/anijain2305
2024-07-03 00:14:51 +00:00
Elias Ellison
b8e5678ad2 Delete lazy ddp optimizer (#120727)
This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides.

Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727
Approved by: https://github.com/jansel, https://github.com/yf225
2024-06-26 21:53:54 +00:00
Will Feng
575bc1e3af [Reopen #114036] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295
Approved by: https://github.com/Chillee
2024-06-25 23:47:08 +00:00
Will Feng
dadc0ed4c8 [Traceable FSDP2] Add aot_eager backend E2E tests for transformer model (#129157)
This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157
Approved by: https://github.com/awgu
ghstack dependencies: #129203
2024-06-23 06:11:11 +00:00
Aaron Orenstein
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
Michael Lazos
2129903aa3 Properly detect nested torch function args (#127496)
Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed.
Fixes https://github.com/pytorch/pytorch/issues/127174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496
Approved by: https://github.com/anijain2305
2024-06-02 03:43:22 +00:00
Simon Fan
ec098b88b6 [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-31 04:38:20 +00:00
PyTorch MergeBot
ce63b676f3 Revert "[compiled autograd] torch.compile API (#125880)"
This reverts commit e1c322112a.

Reverted https://github.com/pytorch/pytorch/pull/125880 on behalf of https://github.com/atalman due to sorry your PR broke lint, need to revert ([comment](https://github.com/pytorch/pytorch/pull/125880#issuecomment-2139605376))
2024-05-30 13:53:31 +00:00
Simon Fan
e1c322112a [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-30 02:10:06 +00:00
Pian Pawakapan
8a31c2aa84 [export] allow complex guards as runtime asserts (#127129)
With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that.

For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs.

In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph.

Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes:
```
# reshape
def forward(self, x, y):  # x: [s0, s1], y: [s2]
    return x.reshape([-1]) + y  # guard s0 * s1 = s2

This leads to the following exported program

class GraphModule(torch.nn.Module):
    def forward(self, x: "f32[s0, s1]", y: "f32[s2]"):
        sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0)
        mul: "Sym(-s2)" = -1 * sym_size_int;  sym_size_int = None
        sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0)
        sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1)
        mul_1: "Sym(s0*s1)" = sym_size_int_1 * sym_size_int_2;  sym_size_int_1 = sym_size_int_2 = None
        add: "Sym(s0*s1 - s2)" = mul + mul_1;  mul = mul_1 = None
        eq: "Sym(Eq(s0*s1 - s2, 0))" = add == 0;  add = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0*s1 - s2, 0) on node 'eq'");  eq = None

        view: "f32[s0*s1]" = torch.ops.aten.view.default(x, [-1]);  x = None
        add_1: "f32[s0*s1]" = torch.ops.aten.add.Tensor(view, y);  view = y = None
        return (add_1,)
```
Another case is symbol divisibility:
```
def forward(self, x):  # x: [s0, s1]
    return x.reshape([-1, x.shape[0] - 1])  # Eq(Mod(s0 * s1, s0 - 1), 0)
```

Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        # check that negation of first guard also shows up as runtime assertion
        if x.shape[0] == y.shape[0]:  # False
            return x + y
        elif x.shape[0] == y.shape[0] ** 3:  # False
            return x + 2, y + 3
        elif x.shape[0] ** 2 == y.shape[0] * 3:  # True
            return x * 2.0, y * 3.0
```
For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard.

One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given.

As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related.

Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925)

This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work.

NOTE: will take naming change suggestions for the flag :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129
Approved by: https://github.com/avikchaudhuri
2024-05-29 17:15:25 +00:00
dshi7
0f67d38f0f add TORCHDYNAMO_CAPTURE_DYNAMIC_OUTPUT_SHAPE_OPS (#127017)
tlparse prints failure description like this

> dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True

adding os env var to set it easier for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017
Approved by: https://github.com/jackiexu1992
2024-05-25 05:42:41 +00:00
Alexandre Ghelfi
b3a8a3cbab Fix typos in torch._dynamo.config.py (#126150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126150
Approved by: https://github.com/Skylion007
2024-05-14 14:27:35 +00:00
Edward Z. Yang
2ba102f689 Implement native support for float inputs in Dynamo and ShapeEnv (#125325)
The big idea is that floats are treated as Tensors on input/output to the FX graph, but on the inside, we immediately call item() on the synthetic Tensor and record regular float operations on it. Canonicalization to Tensor operations will happen in a standalone FX pass. This behavior is controlled by `specialize_float` config variable when set to False.

The generated graph looks like this for the test `test_unspec_float_output`:

```
 def forward(self, L_x_: "f32[3]", L_y_: "f32[]"):
     l_x_ = L_x_
     l_y_ = L_y_

     # File: /data/users/ezyang/a/pytorch/test/dynamo/test_unspec.py:511 in f, code: return x + 1, y * 2
     add: "f32[3]" = l_x_ + 1;  l_x_ = None
     item: "Sym(zf0)" = l_y_.item();  l_y_ = None
     mul: "Sym(2*zf0)" = item * 2;  item = None
     scalar_tensor: "f32[]" = torch.scalar_tensor(mul);  mul = None
     return (add, scalar_tensor)
```

The ingredients:

* **torch/_dynamo/variables/builder.py** When `specialize_float` is False, we wrap float literals with `wrap_symfloat`. This is an unholy mashup of `wrap_symint` and `wrap_unspecialized_primitive`. The overall strategy is that we first generate a tensor argument (because that's what we want to show up into the FX graph), but then immediately call item() on the tensor argument to get a SymNodeVariable, which we will do the rest of the tracing with.  Importantly, this SymNodeVariable is backed with the source of the original float: this means we can guard on the resulting value (something we could NOT do with UnspecializedPythonVariable). This has to be done manually, because if you literally call item() on the tensor, you will end up with an unbacked float. There is a bit of copy paste from wrap_symint and wrap_unspecialized_primitive which we can try to factor out, but this really is its own thing and you should review every line of code in the function.
* **torch/fx/experimental/symbolic_shapes.py** We now can generate guards on float inputs, and these guards are handled inside of ShapeEnv. So we need to be able to allocate (backed!) float symbols, and produce guards for them. Fairly straightforward generalization.
* **torch/_dynamo/codegen.py** I also need to maintain the invariant that there are no float outputs to the FX graph. I chose to do this at codegen time. When we detect a SymNodeVariable on the return stack for a float, we on the fly convert it (via `as_tensor`) to a TensorVariable, which is the true output. We then special case the output bytecode to call item() on it again. The tensor conversion is memoized on SymNodeVariable since we typically run the code generation process twice.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125325
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-05-14 04:10:01 +00:00
Animesh Jain
0935b3d794 [dynamo] Turn on guard_nn_modules (#125202)
Turning on guard_nn_modules adds large number of guards, so we are bound to take a perf hit. But the perf hit is small. These are the numbers

![image](https://github.com/pytorch/pytorch/assets/13822661/c8793906-c8c7-432b-9af4-4594713067be)

First we observe that compared to Python guards, C++ guards give around 6x speedup. This reduces the total time spent in guards. This is shown in the last column (cpp_guards/inductor_optimized_latency). The worst model is around 1.61%, with most of the models below 1%. I think this is good enough signal to turn the config on.

One might also wonder how much guard slowdown occurs with `guard_nn_modules=True`. This is the table
![image](https://github.com/pytorch/pytorch/assets/13822661/932a885b-1c03-424b-8405-5bc8fd35dd39)

For most models, the guard overhead with nn module guards is under 2x. There are a few outliers, where the slowdown is really high and for those models we spend 1%-2% time in C++ guards as shown in first table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125202
Approved by: https://github.com/ezyang
2024-05-11 19:28:24 +00:00
Animesh Jain
b62e89c1b8 [dynamo] Do not turn on record relay with TORCH_COMPILE_DEBUG (#125488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125488
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-05-04 05:10:31 +00:00
Boyuan Feng
b91f83f181 [cudagraph] add config for cudagraph managed input mutation support (#124754)
Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature.

Test Plan: ci

Differential Revision: D56481353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754
Approved by: https://github.com/eellison
2024-04-24 04:23:53 +00:00
Animesh Jain
704fac5618 [dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231
Approved by: https://github.com/jansel
ghstack dependencies: #124230, #124237
2024-04-18 06:36:20 +00:00
PyTorch MergeBot
530bf391cc Revert "[dynamo] Turn on CPP guard manager (#123547)"
This reverts commit 3e98bdd66d.

Reverted https://github.com/pytorch/pytorch/pull/123547 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058337419))
2024-04-16 06:38:15 +00:00
willfengg
f1654fd4b0 [PT2D][FSDP] skip FSDP hooks base on dynamo config (#123021)
unit test: `pytest test/distributed/_composable/fsdp/test_fully_shard_compile.py`

For FSDP, we turn on/off compiling hooks base on `torch._dynamo.config.skip_fsdp_hooks`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123021
Approved by: https://github.com/yf225, https://github.com/anijain2305
2024-04-13 01:47:25 +00:00
Animesh Jain
3e98bdd66d [dynamo] Turn on CPP guard manager (#123547)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123547
Approved by: https://github.com/jansel
2024-04-12 23:30:56 +00:00
Jason Ansel
70b8c58f84 [dynamo] Emit warning to turn on capture_scalar_outputs (#123896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123896
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804
2024-04-12 19:03:13 +00:00
Oguz Ulgen
57a2032c7a Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-11 05:51:06 +00:00
PyTorch MergeBot
6b18daf205 Revert "Delete Lark (#123689)"
This reverts commit a631461eef.

Reverted https://github.com/pytorch/pytorch/pull/123689 on behalf of https://github.com/PaliC due to This PR seems to be breaking  test_binary_ufuncs.py ([comment](https://github.com/pytorch/pytorch/pull/123689#issuecomment-2048489549))
2024-04-10 21:48:04 +00:00
Oguz Ulgen
a631461eef Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-10 19:41:54 +00:00
Peter Bell
8865425ff7 [minifier] Add config flag to ignore non-fp values (#123006)
When minifying, the after-aot minifier ignores non-floating values by
default but does check them when running the the initial graph dump step.
This means we may capture a graph that doesn't fail the tester and doesn't have
any meaningful divergence.

For example, the derivative of `elu(x)` depends on `x > 0` so this value is
saved for backwards and so becomes a graph output. However, the difference
between `FLT_MIN` and `0` in `x` is now enough to trigger an accuracy failure.

I fix this by adding a config variable and environment variable to ignore these
non floating point values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123006
Approved by: https://github.com/ezyang
ghstack dependencies: #123005
2024-04-09 03:34:09 +00:00
Guilherme Leobas
84658d9c4f Enable capture_func_transforms by default (#122211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211
Approved by: https://github.com/zou3519
2024-04-05 03:29:11 +00:00
eellison
5f46312dbb Reapply "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864) (#122713)
This reverts commit 92ed8553a6.

No longer importing codecache or boxed_nop at top level, both of which casued issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122713
Approved by: https://github.com/anijain2305
2024-04-02 16:11:00 +00:00
Animesh Jain
60f3c092d4 [dynamo] Config option to Inline builtin nn module forward (#122725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122725
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647, #122716, #122769, #122818
2024-03-28 03:01:27 +00:00
Animesh Jain
c108696228 [dynamo][guards-cpp-refactor][easy] Env variable to turn on cpp manager (#122646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122646
Approved by: https://github.com/jansel
2024-03-27 19:40:37 +00:00
IvanKobzarev
9b095c3fe6 [dynamo] Config to not emit runtime asserts (#122603)
Repetition on squashed & merged by mistake https://github.com/pytorch/pytorch/pull/122406

Differential Revision: [D55312394](https://our.internmc.facebook.com/intern/diff/D55312394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122603
Approved by: https://github.com/ezyang
2024-03-25 21:17:44 +00:00
Animesh Jain
8860c625ea [dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726
Approved by: https://github.com/jansel
2024-03-19 03:11:31 +00:00
Animesh Jain
f84d560236 [dynamo] Raise accumulated cache size limit (#122130)
Fixes #114511

This was raised by IBM folks where the a LLM compile was failing because it had more than 64 layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122130
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #121954, #122005
2024-03-19 02:35:48 +00:00
Animesh Jain
92ed8553a6 Revert "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864)
This reverts commit 9373ad0bb8.

Revert "Add Cudagraphs disable checking (#121018)"

This reverts commit 4af0e634bf.

Causes compilation time increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864
Approved by: https://github.com/eellison
2024-03-15 00:03:09 +00:00
eellison
4af0e634bf Add Cudagraphs disable checking (#121018)
Adds the same cudagraphs disable checking from inductor - cudagraph trees to cudagraphs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121018
Approved by: https://github.com/ezyang
ghstack dependencies: #121017
2024-03-08 22:47:24 +00:00
PyTorch MergeBot
3d7cf8f392 Revert "Limit loop unrolling (#120023)"
This reverts commit 6cc7f9a2e6.

Reverted https://github.com/pytorch/pytorch/pull/120023 on behalf of https://github.com/anijain2305 due to breaks llms export ([comment](https://github.com/pytorch/pytorch/pull/120023#issuecomment-1974104633))
2024-03-02 00:04:08 +00:00
angelayi
c844b377fa [dynamo] Reorder logs (#116106)
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.

Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600

There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly

TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
2024-03-01 17:04:24 +00:00
PyTorch MergeBot
63b259492a Revert "[dynamo] Reorder logs (#116106)"
This reverts commit c5472628ff.

Reverted https://github.com/pytorch/pytorch/pull/116106 on behalf of https://github.com/clee2000 due to landrace with 342e7929b8, which removed the import for warnings.  Should be an easy fix after rebase c5472628ff ([comment](https://github.com/pytorch/pytorch/pull/116106#issuecomment-1972586180))
2024-03-01 06:25:46 +00:00
Angela Yi
c5472628ff [dynamo] Reorder logs (#116106)
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.

Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600

There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly

TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
2024-03-01 04:48:44 +00:00
Aaron Orenstein
6cc7f9a2e6 Limit loop unrolling (#120023)
Tacotron2 causes massive loop unrolling resulting in very large graphs (26k nodes) which was causing inductor (and tracing itself) to choke.

The unrolling size is controlled by the environment variable TORCHDYNAMO_MAX_LOOP_UNROLL_NODES which defaults to the arbitrary value 5000.

This updates the tacotron2 timings as follows:
eager timing: 3m:23s -> 35s
aot_eager timing: 4m:12s -> 39s
inductor timing: 22m:24s ->1m

For reference the big loop in tacotron2 was this one (model.py[405]):
```
        while len(mel_outputs) < decoder_inputs.size(0) - 1:
            decoder_input = decoder_inputs[len(mel_outputs)]
            mel_output, gate_output, attention_weights = self.decode(decoder_input)
            mel_outputs += [mel_output.squeeze(1)]
            gate_outputs += [gate_output.squeeze(1)]
            alignments += [attention_weights]
```
which gets unrolled and inlined adding about 36 nodes to the graph per iteration.

Fixes #98467
Relates to #102839 which hopefully will result in a better fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120023
Approved by: https://github.com/yanboliang
2024-02-27 20:44:21 +00:00
Sam Larsen
2fb32a5f07 Enable fake tensor caching in fbcode by default (#118555)
Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too.

Test Plan: Ran torchbench benchmarks in fbcode

Differential Revision: [D53771626](https://our.internmc.facebook.com/intern/diff/D53771626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555
Approved by: https://github.com/eellison
2024-02-26 17:35:23 +00:00
PyTorch MergeBot
7d780ff86f Revert "Enable fake tensor caching in fbcode by default (#118555)"
This reverts commit 0f2fbbff10.

Reverted https://github.com/pytorch/pytorch/pull/118555 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing one model test internally. Please take a look at the diff for more info D53189048 ([comment](https://github.com/pytorch/pytorch/pull/118555#issuecomment-1939550273))
2024-02-12 20:51:23 +00:00
Sam Larsen
0f2fbbff10 Enable fake tensor caching in fbcode by default (#118555)
Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too.

Test Plan: Ran torchbench benchmarks in fbcode

Differential Revision: [D53189048](https://our.internmc.facebook.com/intern/diff/D53189048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555
Approved by: https://github.com/eellison
2024-02-09 05:42:16 +00:00
Chien-Chin Huang
1d2382f141 [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662)
**Summary**
The reducer of `DistributedDataParallel`  is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor.

**Key Logic**
1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters.
2. In the first forward() call, if `DistributedDataParallel` is not compiled, all  `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`.
3.  `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter.

**Bucketing**
The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces.

The bucketing is done in a separate PR.

Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662
Approved by: https://github.com/wconstab
2024-02-08 03:03:15 +00:00
Will Constable
abe3c55a6a Update DDP dynamo debug docs (#118295)
Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295
Approved by: https://github.com/LucasLLC, https://github.com/wanchaol
2024-01-29 14:58:26 +00:00
Oguz Ulgen
47b5a6b05d [Dynamo] Analyze triton kernels via tracing to determine mutations (#117300)
This PR adds TTIR lexing and parsing in order to analyze which of the user defined triton kernel inputs are mutated.

Differential Revision: [D53165999](https://our.internmc.facebook.com/intern/diff/D53165999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117300
Approved by: https://github.com/jansel
2024-01-29 06:37:08 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
Shunting Zhang
fe10b1800f LazyGraphModule (#117911)
I feel it's easier to open a new PR rather than iterating on the previous PR (https://github.com/pytorch/pytorch/pull/105257 ) since this is more like a rewrite.

In this PR, instead of changing GraphModule directly which can easily causes BC issue, I create a LazyGraphModule class as Zachary & Jason suggested in comments from the previous PR.

The difference between LazyGraphModule and GraphModule is mainly about how re-compile for the graph module happens. In GraphModule the recompilation happens 'eagerly': constructing a GraphModule will cause the recompilation. While in LazyGraphModule, we just mark the module as needing recompilation. The real recompilation only happens when absolutely required (e.g. call forward method, access the code property etc.). In a lot of cases in torch.compile, the real recompilation eventually is not triggered at all. This can save a few seconds of compilation time.

By default, GraphModule rather than LazyGraphModule is used. `use_lazy_graph_module(True)` context manager can be used to pick LazyGraphModule instead. This has been applied to the torch.compile stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117911
Approved by: https://github.com/jansel
2024-01-27 04:10:18 +00:00
Jason Ansel
e5e9f390be [dynamo] Optimize overheads from _TorchDynamoContext (#118070)
Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `18.1us`
- After `12.2us`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118070
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
ghstack dependencies: #118065
2024-01-25 05:04:56 +00:00
Quinn Zhu
12662f4d95 [dynamo] add username in debug path (#117820)
Summary: No user name may cause conflict and permission error when people share a dev server

bypass-github-pytorch-ci-checks

Test Plan: ci

Differential Revision: D52895486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117820
Approved by: https://github.com/kflu, https://github.com/DanilBaibak
2024-01-24 10:14:20 +00:00