This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:
1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks
The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.
Here is how automatic dynamic synchronization works:
1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.
For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
Summary:
- Log export errors to Scuba and mark them with "classified" and "unclassified"
- Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute
- Add `case_name` for some exceptions.
Test Plan:
Running the code below logs a classified error to `torch_export_usage` table in Scuba.
```
import torch
from torch._export.db.case import SupportLevel
class TorchSymMin(torch.nn.Module):
"""
torch.sym_min operator is not supported in export.
"""
def forward(self, x):
return x.sum() + torch.sym_min(x.size(0), 100)
example_args = (torch.randn(3, 2),)
tags = {"torch.operator"}
support_level = SupportLevel.NOT_SUPPORTED_YET
model = TorchSymMin()
torch.export.export(model, example_args)
``
Differential Revision: D59981459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327
Approved by: https://github.com/zhxchen17
`unimplemented` is a function that raises an error, so
`raise unimplemented(...)` never reaches the `raise`.
Another related issue is that `raise unimplemented(...) from e`
doesn't attach the exception cause correctly. I fix this by adding
a `from_exc` argument to `unimplemented`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122136
Approved by: https://github.com/lezcano
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".
The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".
The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".
The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri
In my work on making guards installed eagerly (look up the stack), I found that our checkpoint/restore mechanism is very broken. There is lots of state (especially in shape_env) which we don't checkpoint and restore properly. We also have lots of mutable state on variable trackers already which is not checkpointed/restored. (See other PRs in this stack for some spot fixes.)
Since we wanted to get rid of this anyway for making VariableTracker mutable, I figured I would just switch to restarting analysis.
For other usages of copy_graphstate/restore_graphstate:
1) Many usages were pointless and not needed, these are removed in PRs below this.
2) Some other usage (similar to this one) is removed in PRs above this.
3) The tricky one I am not handling is higher_order_ops, which uses checkpoint/restore a lot. There might be some cases there where this speculate/restart trick won't work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112902
Approved by: https://github.com/voznesenskym
We want to get to a point where most UserErrors link to exportdb examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise UserErrors to make or point to examples that make fixing such errors more obvious for users.
In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.
Retry of #110733 which was reverted due to a landrace.
Differential Revision: [D50087148](https://our.internmc.facebook.com/intern/diff/D50087148/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110878
Approved by: https://github.com/gmagogsfm, https://github.com/tugsbayasgalan
We want to get to a point where most `UserError`s link to `exportdb` examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise `UserError`s to make or point to examples that make fixing such errors more obvious for users.
In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.
Differential Revision: [D50020465](https://our.internmc.facebook.com/intern/diff/D50020465/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110733
Approved by: https://github.com/zhxchen17
Ideally all `_dynamo.exc.UserError`s should have "case names", i.e., link to examples in `exportdb`.
This PR adds case names to several instances of `_dynamo.exc.UserError`. In particular, looking at coverage based on `UserErrorType`:
* `DYNAMIC_CONTROL_FLOW`, `ANTI_PATTERN`, and `STANDARD_LIBRARY` are fully covered.
* `CONSTRAINT_VIOLATION` and `DYNAMIC_DIM` have no coverage. We don't seem to have any dedicated examples of specifying dynamic shapes in `exportdb` (although they are used in some other examples without explanation, to avoid some specialization that would make such examples moot).
* `INVALID_INPUT` is only partly covered. Frankly this is tedious to cover via examples.
Differential Revision: [D49928518](https://our.internmc.facebook.com/intern/diff/D49928518/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110555
Approved by: https://github.com/angelayi, https://github.com/ydwu4
We want cond to always throw errors despite user's torch.compile mode.
The current implementation is to
1. catch the UserError.GRAPH_BREAK_IN_CONTROL_FLOW and once saw it, we directly raise: once in [break_graph_if_unsupported](bad3f2db40/torch/_dynamo/symbolic_convert.py (L1250)), which catches and raises for call_function (entry point of higher order operator) and a few others.
2. The raised exception is caught and raised again in [step](bad3f2db40/torch/_dynamo/symbolic_convert.py (L691)), where all instructions' exceptions are handled.
3. At the top-level, we treat it like an hard error and not supressing the errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108027
Approved by: https://github.com/zou3519
ghstack dependencies: #108025, #108026
Previously during torch.export(), when an exception is raised during tracing, Dynamo displays this error:
“You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True”
This is not viable in torch.export(), thus this diff suppresses this suggestion during export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105767
Approved by: https://github.com/anijain2305
Summary:
We are working toward full model compilation, where when compilation error happens, we just fall back to eager mode rather than error out.
But at the same time, we should fix these issues if they are bugs. We will:
* 1/ log warnings in OSS;
* 2/ log warnings and write them into Scuba in fbcode;
to prevent us from ignoring these issues.
Test Plan: Manual test
Differential Revision: D47506314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105307
Approved by: https://github.com/jansel