Commit Graph

51 Commits

Author SHA1 Message Date
Brian Hirsh
0d71a9dd5b fix incorrect interaction between DDPOptimizer and donated buffers (#160745)
This should fix https://x.com/wightmanr/status/1953147089518772254?t=ng_R4t0-tRhO_qQE8NqOhw&s=19. Still working on adding a reasonable test.

You can see more of a description of the problem in the code comments. But the TLDR is that:

* When using DDPOptimizer, we partition the graph and compile several subgraphs. So 1 dynamo graphs becomes N AOT/inductor artifacts
* We have some existing logic to stash graph metadata (`fw_metadata`) in dynamo's TracingContext. When using DDPOptimizer, we generate one `fw_metadata` per **AOT** graph, and we stash it on the 1 TracingContext from dynamo. So we end up clobbering the `fw_metadata` for graph i-1 when AOT and inductor start compiling graph i
* This is normally ok, but it becomes a problem if inductor ever wants to read from this `fw_metadata` during **backward compilation**. Why? We (by default) compile the backwards lazily. So when using DDPOptimizer, we will compile backward graph N, then bw graph N-1, etc. But... at the time that we have stated compiling bw graph N-1, its corresponding fw_metadata has already been clobbered! So we end up reusing graph N's metadata for all of our backward graph compilations. With donated buffer metadata, that means we end up donated and writing into incorrect input buffers

The fix that I added was to add more dedicated DDPOptimizer metadata into the TracingContext, so we can properly switch between these N different `fw_metadata` objects in the backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160745
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-04 21:57:27 +00:00
Lucas Kabela
453cfa5153 typing distributed.py (#160365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160365
Approved by: https://github.com/StrongerXi
ghstack dependencies: #160362, #160363, #160364
2025-08-15 02:09:31 +00:00
Boyuan Feng
38410cf9b5 Fix DDPOptimizer issue on static tensor index (#155746)
We rely on `_try_get_metadata_from_dynamo()` to get static input indices. When the meta info is missing, it just returns an empty list of static input indices. This wrong list of static input indices lead to repeated cudagraph re-recording, which looks like a hang from the user perspective. bc3972b80a/torch/_functorch/aot_autograd.py (L1025-L1031)

The root cause is `split_module` in DDP Optimizer loses meta info and gm attributes. This PR fixes the issue by propagating these metadata from original module to submodules.
bc3972b80a/torch/_dynamo/backends/distributed.py (L515-L517)

Fixes #140395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155746
Approved by: https://github.com/xmfan, https://github.com/bdhirsh
2025-06-14 00:15:58 +00:00
Xuehai Pan
3ce352e389 [BE][PYFMT] migrate PYFMT for torch._dynamo to ruff format (#144549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549
Approved by: https://github.com/jansel
2025-02-28 03:03:53 +00:00
Raymond Li
21c2565f35 Document dynamo (#146736)
Many files in dynamo are currently lacking file/module-level documentation, which makes it hard to know what they do at a glance and without digging into the code. This fixes that.

Note: documentation was AI-generated and could be incorrect, please review carefully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146736
Approved by: https://github.com/jansel, https://github.com/StrongerXi, https://github.com/anijain2305, https://github.com/zou3519
2025-02-13 00:02:21 +00:00
Aaron Orenstein
a79100ab11 PEP585 update - torch/_dynamo (#145105)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105
Approved by: https://github.com/bobrenjc93
2025-01-18 20:47:11 +00:00
Simon Fan
f4969c8235 fix torch.compile + ddp + non-reentrant AC pack hook firing count (#144271)
FIXES https://github.com/pytorch/pytorch/issues/144035

In order to preserve hook firing semantics, we disabled pack/unpack hooks for torch.compile: https://github.com/pytorch/pytorch/pull/123196. In DDP under torch.compile, there's this other callsite that we need to disable hooks for

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144271
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2025-01-07 21:08:52 +00:00
Tom Ritchford
dc23f1944a Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-12 17:39:14 +00:00
PyTorch MergeBot
5c97ac9721 Revert "Remove unused Python variables in torch/[_-a]* (#133492)"
This reverts commit fda975a7b3.

Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))
2024-12-11 17:29:12 +00:00
Tom Ritchford
fda975a7b3 Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-10 21:48:44 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
chilli
392221b390 Made DDPOptimizer work with HOPs (#138787)
Fixes https://github.com/pytorch/pytorch/issues/137481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138787
Approved by: https://github.com/yf225
ghstack dependencies: #138733, #138794, #138881
2024-10-25 18:59:01 +00:00
Oguz Ulgen
6e79932543 Add basic mypy annotations to dynamo (#132415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
2024-08-04 18:43:36 +00:00
PyTorch MergeBot
3558a8cf4a Revert "Add basic mypy annotations to dynamo (#132415)"
This reverts commit 71e22e0959.

Reverted https://github.com/pytorch/pytorch/pull/132415 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
Oguz Ulgen
71e22e0959 Add basic mypy annotations to dynamo (#132415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
2024-08-01 20:14:25 +00:00
Oguz Ulgen
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
Xuehai Pan
e74ba1b34a [BE][Easy][15/19] enforce style for empty lines in import segments in torch/_d*/ (#129767)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767
Approved by: https://github.com/anijain2305
2024-07-31 21:18:11 +00:00
Elias Ellison
b8e5678ad2 Delete lazy ddp optimizer (#120727)
This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides.

Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727
Approved by: https://github.com/jansel, https://github.com/yf225
2024-06-26 21:53:54 +00:00
Animesh Jain
a0604193a2 handle call_function with Parameter args in DDPOptimizer splitting (#128034)
When nn module inlining is enabled, modules are replaced with the underlying function calls in the output fx graph.
example:
```
class GraphModule(torch.nn.Module):
  def forward(self, L_x_: "f32[1024, 1024]"):
      l_x_ = L_x_

      # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_structured_trace.py:284 in forward, code: return self.layers(x)
      l__self___layers_0: "f32[1024, 1024]" = self.L__self___layers_0(l_x_);  l_x_ = None
      l__self___layers_1: "f32[1024, 1024]" = self.L__self___layers_1(l__self___layers_0);  l__self___layers_0 = None
      return (l__self___layers_1,)
```

will be
```
class GraphModule(torch.nn.Module):
    def forward(self, L_self_layers_0_weight: "f32[1024, 1024]", L_self_layers_0_bias: "f32[1024]", L_x_: "f32[1024, 1024]", L_self_layers_1_weight: "f32[1024, 1024]", L_self_layers_1_bias: "f32[1024]"):
        l_self_layers_0_weight = L_self_layers_0_weight
        l_self_layers_0_bias = L_self_layers_0_bias
        l_x_ = L_x_
        l_self_layers_1_weight = L_self_layers_1_weight
        l_self_layers_1_bias = L_self_layers_1_bias

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias)
        input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_x_, l_self_layers_0_weight, l_self_layers_0_bias);  l_x_ = l_self_layers_0_weight = l_self_layers_0_bias = None
        input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_self_layers_1_weight, l_self_layers_1_bias);  input_1 = l_self_layers_1_weight = l_self_layers_1_bias = None
        return (input_2,)
```
The DDP optimizer when performing splitting, does not handle the inlined graph since it does not handle function calls since earlier we did not have function calls with params as inputs. (but calls to modules instead).

This diff addresses that, it uses the example_value in the arguments to determine Parameter arguments of a function call
and the Parameter properties.
This address #https://github.com/pytorch/pytorch/issues/127552

running the optimizer on the code above with inlining yields to the following splitting:
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    return linear

---submod_1 graph---
graph():
    %input_1 : [num_users=1] = placeholder[target=input_1]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%input_1, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return linear

---final graph---
graph():
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_bias]
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_bias]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return (submod_1,)
---------------

```
where as without inlining it uses to be
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l__self___layers_0 : [num_users=1] = call_module[target=L__self___layers_0](args = (%l_x_,), kwargs = {})
    return l__self___layers_0
/data/users/lsakka/pytorch/pytorch/torch/_inductor/compile_fx.py:133: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(

---submod_1 graph---
graph():
    %l__self___layers_0 : [num_users=1] = placeholder[target=l__self___layers_0]
    %l__self___layers_1 : [num_users=1] = call_module[target=L__self___layers_1](args = (%l__self___layers_0,), kwargs = {})
    return l__self___layers_1

---final graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_,), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0,), kwargs = {})
    return (submod_1,)
---------------
```

TESTING:

(1) running
``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1   pytest test/distributed/test_dynamo_distributed.py -k ```
result in reduction in failures from 6 to 2 with this PR.

The two remaining are FSDP related which does not sounds trivial and have so many details. will leave them for future work.

Co-authored-by: Animesh Jain <anijain@umich.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128034
Approved by: https://github.com/anijain2305, https://github.com/wconstab
2024-06-13 17:07:27 +00:00
Boyuan Feng
0c1ac4484d Support call_method in DDPOptimizer (#121771)
This PR fixes Issue #111279.

While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be:
```python
class ToyModel(nn.Module):
    def __init__(self,):
        super().__init__()
        self.linear = nn.Linear(128, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear.forward(x) # Error
        # return self.linear(x) # OK
```

Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes.

This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771
Approved by: https://github.com/yf225
2024-03-13 20:03:15 +00:00
Elias Ellison
d03b11ad5b Pass inductor strides forward in ddp optimizer (#120523)
# Note: Returning Fake Tensors on First AOT Autograd Call
            #
            # Inductor will optimize strides of outputs when it deems it profitable.
            # For instance, converting to channels last. When we split the graph here
            # into multiple inductor compilations, we need to make sure that the
            # output strides of one compilation is appropriately passed to the subsequent
            # compilations. However, the mapping from inductor output to dynamo output
            # is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing,
            # subclass handling, etc. In order to replay all this logic we set a flag such that
            # the first invocation of inductor in aot_autograd will return Fake Tensors with
            # appropriate strides. Then, all of aot autograd's runtime logic is replayed.
            # This gives us the appropriately strided outputs here which will reflect runtime strides.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523
Approved by: https://github.com/yf225, https://github.com/bdhirsh
2024-02-29 22:25:00 +00:00
Elias Ellison
9c9bde515c Factor out Submod compilers (#120527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120527
Approved by: https://github.com/kadeng
2024-02-28 22:11:47 +00:00
Edward Z. Yang
1a1fc1047d Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
ghstack dependencies: #120712
2024-02-28 01:01:41 +00:00
PyTorch MergeBot
f3dd2a544c Revert "Add structured trace logs (#120289)"
This reverts commit 9dfaef962c.

Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))
2024-02-27 19:49:05 +00:00
Edward Z. Yang
9dfaef962c Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
2024-02-27 00:04:23 +00:00
Will Constable
abe3c55a6a Update DDP dynamo debug docs (#118295)
Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295
Approved by: https://github.com/LucasLLC, https://github.com/wanchaol
2024-01-29 14:58:26 +00:00
Edward Z. Yang
d03173e88c Unify MYPYINDUCTOR and MYPY (#118432)
The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this.

Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418
2024-01-27 17:23:20 +00:00
Will Feng
a27ed4d364 [dynamo / DDP] Add optimize_ddp_lazy_compile config to control lazy compile for DDPOptimizer (False by default) (#116292)
We want to enable `optimize_ddp_lazy_compile` by default as soon as possible, becuase it will fix stride mismatch errors (see motivation: https://github.com/pytorch/pytorch/pull/114154).

However, lazy compile currently causes shape mismatch in other cases (`test_graph_split_inductor_transpose`) and we need to fix them before we can enable it by default.

Differential Revision: D52373445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116292
Approved by: https://github.com/williamwen42, https://github.com/wconstab
2023-12-21 22:34:24 +00:00
Jon Chuang
2cf0cf8137 [dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
Fixes https://github.com/pytorch/pytorch/issues/113812, https://github.com/pytorch/pytorch/issues/102591, Probably fixes: https://github.com/pytorch/pytorch/issues/113740, https://github.com/pytorch/pytorch/issues/113786, https://github.com/pytorch/pytorch/issues/113788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114154
Approved by: https://github.com/wconstab, https://github.com/yf225
2023-12-06 18:50:14 +00:00
PyTorch MergeBot
e38a3a6079 Revert "[dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)"
This reverts commit 3f574eadb4.

Reverted https://github.com/pytorch/pytorch/pull/114154 on behalf of https://github.com/clee2000 due to reverted internally, broke internal builds, not sure why bot isn't working ([comment](https://github.com/pytorch/pytorch/pull/114154#issuecomment-1832496040))
2023-11-29 18:43:17 +00:00
Jon Chuang
3f574eadb4 [dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
Fixes https://github.com/pytorch/pytorch/issues/113812, https://github.com/pytorch/pytorch/issues/102591, Probably fixes: https://github.com/pytorch/pytorch/issues/113740, https://github.com/pytorch/pytorch/issues/113786, https://github.com/pytorch/pytorch/issues/113788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114154
Approved by: https://github.com/wconstab
2023-11-28 06:29:43 +00:00
Will Constable
2333d381b2 Make 'distributed' TORCH_LOGS include ddpoptimizer (#114376)
There are now 3 ways to see logs from ddpoptimzer.
1) TORCH_LOGS="distributed"
2) TORCH_LOGS="dynamo"
3) TORCH_LOGS="torch._dynamo.backends.distributed"

(1 and 2 are different supersets of 3 that also include other content)

Note: ddp_graphs is still a separate 'artifact' logger, which just
includes graph dumps from the graph-splitting process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114376
Approved by: https://github.com/wanchaol
2023-11-28 02:39:28 +00:00
voznesenskym
081c5b3adc Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926) (#114526)
Summary:

The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.

This PR is the result of *a lot* of back and forth with ezyang and eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:

1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)

This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).

We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (ezyang did this)

cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: huydhn, Chillee

Differential Revision: D51566250

Pulled By: voznesenskym

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114526
Approved by: https://github.com/Chillee, https://github.com/huydhn
2023-11-26 23:40:32 +00:00
PyTorch MergeBot
2f3beb715c Revert "Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926)"
This reverts commit 2ca1119d53.

Reverted https://github.com/pytorch/pytorch/pull/113926 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113926#issuecomment-1822713852))
2023-11-22 12:52:33 +00:00
PyTorch MergeBot
e239a2b2d7 Revert "[dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)"
This reverts commit 266054c3ca.

Reverted https://github.com/pytorch/pytorch/pull/114154 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack https://github.com/pytorch/pytorch/pull/113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/114154#issuecomment-1822704476))
2023-11-22 12:46:15 +00:00
Jon Chuang
266054c3ca [dynamo / DDP] - lazily compile submodules - to propagate real tensor strides to backend compiler (#114154)
Fixes https://github.com/pytorch/pytorch/issues/113812, https://github.com/pytorch/pytorch/issues/102591, Probably fixes: https://github.com/pytorch/pytorch/issues/113740, https://github.com/pytorch/pytorch/issues/113786, https://github.com/pytorch/pytorch/issues/113788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114154
Approved by: https://github.com/wconstab
2023-11-21 22:40:08 +00:00
voznesenskym
2ca1119d53 Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926)
The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.

This PR is the result of *a lot* of back and forth with @ezyang and @eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:

1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)

This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).

We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (@ezyang did this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113926
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-11-20 23:06:37 +00:00
Brian Hirsh
da914aed21 error when using _dynamo.optimize_ddp=True and _inductor.keep_output_stride=False together (#108235)
From talking to @wconstab, we agreed that because of the way DDPOptimizer is written, it is (sort of) incompatible with inductor's `keep_output_stride=False` optimizations (and will cause silent correctness problems if you use them ogether). Added an assertion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108235
Approved by: https://github.com/wconstab
ghstack dependencies: #108081
2023-09-05 20:02:35 +00:00
Animesh Jain
d0e5c681f5 [dynamo][ddp][ac] Fallback to single bucket when higher order op (#104639)
This helps unblock an internal model. The real fix requires lot of work, which might question the alternate approach of partitioning AOT graphs instead of Dynamo graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104639
Approved by: https://github.com/wconstab
2023-07-06 02:20:15 +00:00
Will Constable
55cf5c00fa Improve DDPOptimizer Logging (#103489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103489
Approved by: https://github.com/ezyang
2023-06-14 22:24:44 +00:00
Will Constable
fee01640df Make DDPOptimizer handle subgraphs without outputs (#103488)
Subgraphs are partitions cut out of a whole graph. Outputs of a subgraph are either global outputs of the original graph, or can be outputs of a partition that feed inputs of the subsequent partition.  Subgraphs are created using the fx utility 'passes.split_module', which requires that each partition
have at least one output node.

In cases where DDPOptimizer asked the partitioner to cut the graph around a set of nodes which only
performed inplace mutation, the partitioner could be left trying to create a subgraph with no output nodes, violating its assumptions.

To circumvent this, DDPOptimizer can expand the set of nodes marked for inclusion in a subgraph that has no outputs until it includes a node that is an output for that subgraph. It still traverses nodes of the original graph in reverse order and only considers widening a subgraph by iterating further in reverse order than it would have ordinarily done (past the cut point dictated by paramter count). It may still be possible the subgraph reaches the input node of the graph without satisfying the subgraph-output condition, in which case an error would still be raised by the partitioner.

Fixes #103385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103488
Approved by: https://github.com/anijain2305
2023-06-14 01:16:04 +00:00
Edward Z. Yang
fa40195fac Don't set_current_node in DDP. (#101046)
Fixes https://github.com/pytorch/pytorch/issues/101045

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101046
Approved by: https://github.com/wconstab, https://github.com/malfet
2023-05-12 14:37:22 +00:00
Bert Maher
e0bf51d3bf [dynamo] Add ddp_graphs artifact (#100021)
I want to be able to decouple DDP graph printing from the rest of
dynamo DEBUG-level logging, since frequently these logs are particularly
enlightening.

Differential Revision: [D45290919](https://our.internmc.facebook.com/intern/diff/D45290919/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100021
Approved by: https://github.com/wconstab, https://github.com/mlazos
2023-04-27 03:53:23 +00:00
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Edward Z. Yang
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Edward Z. Yang
d01ee10b25 Add detect_fake_mode (#98321)
This replaces fake_mode_from_tensors but it preferentially looks for
fake_mode in TracingContext and also if there is an active fake mode
on the dispatch stack, before groveling in tensors to find it.

This advances PegasusForCausalLM, which was previously failing because
we generated a graph that had a parameter (non-fake) and a SymInt,
and thus previously we failed to detect the correct fake mode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98321
Approved by: https://github.com/voznesenskym
2023-04-05 22:15:16 +00:00
Edward Z. Yang
5df59f957f Fix G001,G002,G003 in logs to % syntax (#97812)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812
Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos
2023-04-01 01:43:33 +00:00
Andrew Gu
d9cd9a13bc [BE][DDPOptimizer] De-dup p and param (#95654)
The `param` from `param = target.get_parameter(name)` should be the same as `p` from `target.named_parameters()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95654
Approved by: https://github.com/wconstab
2023-03-01 01:17:09 +00:00
Kazuaki Ishizaki
46385b3e48 Fix typos under torch/_dynamo directory (#95599)
This PR fixes typos in comments and messages of `.py` files under `torch/_dynamo` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95599
Approved by: https://github.com/ezyang
2023-02-28 03:44:24 +00:00
Will Constable
9fb9219478 Make DDPOptimizer work with torch._dynamo.explain() (#94749)
GraphModules that were created during DDPOptimizer graph breaking
lacked `compile_subgraph_reason`, which caused an exception when
running .explain().

Now the reason is provided and users can use .explain() to find out
that DDPOptimizer is causing graph breaks.

Fixes #94579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94749
Approved by: https://github.com/voznesenskym
2023-02-14 01:33:47 +00:00