Summary:
Customize splitter behavior to mark `get_attr` nodes as acc supported.
Currently these nodes are excluded by `FxNetAccNodesFinder` which marks all nodes with op not in `CALLABLE_NODE_OPS` ("call_module", "call_function", "call_method") as unsupported.
Before this change, merge-net is split into an almost empty cpu submodule with a single empty output node:
```
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:###### debug_print nodes for _run_on_cpu_0
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:Found output node: n.name='output', n.target='output', n.args=((),), n.kwargs={}, n.meta={}
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:return ()
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:
_run_on_cpu_0 stats for merge:
[output] output: 1
```
full log: P1678727348 (generated using same command as below)
Test Plan:
Tested by lowering `ig_organic_feed_cn_v2_mtml` using cmd:
```
buck run mode/opt-split-dwarf //tgif/cli:cli -- --model-name=ig_organic_feed_cn_v2_mtml --model-type ig_organic_feed_cn_v2_mtml --world-size=1 --storage-mode 1 --inference-dtype=FP16 --meta-transform=False --use-random-weights=True --accelerator-arch=3 --enable-input-dist=True --embedding-tables-dtype=FP16 --mtia-use-torch-export=True embedding-quantization-pass torchrec-sharding-pass tgif-split-pass gen-app-graph-pass tgif-mtia-lowering-pass dense-quantization-pass save-torch-package-pass generate-model-package-pass pack-weights-and-save-pass 2>&1 | tee /tmp/publish_ig_organic_feed_cn_v2_mtml_mtia_export_20241114_splitter_2.log
```
Output shows only 1 acc submodule is generated for merge:
```
INFO 18:33:15.951 1735650 utils.py:235: [TGIF] num of acc submodules: 1
INFO 18:33:15.952 1735650 utils.py:236: [TGIF] num of cpu submodules: 0
INFO 18:33:16.534 1735650 logging_utils.py:53: [TGIF] _run_on_acc_0 graph module debug info: https://www.internalfb.com/intern/everpaste/?color=0&handle=GK4VKhWsDKF9VdsDAKxhR6KAlhJ0br0LAAAz
INFO 18:33:16.534 1735650 utils.py:257: [TGIF] Start MTIA lowering _run_on_acc_0 in merge, device ordinal: -1
```
full log: P1679596796
Differential Revision: D65983916
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140794
Approved by: https://github.com/ezyang
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/mcr229
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Summary: output nodes may be eliminated to the input nodes if only partial output nodes are specified. add option to check results for all output nodes in the partitioned graph
Test Plan: see D65367305
Reviewed By: qcyuan
Differential Revision: D65367305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139774
Approved by: https://github.com/jfix71
This fix was a bit more involved:
1) It fixes a item_memo loss place.
2) It updates a test to be eager instead of aot_eager since it reveals a very obscure bug related to replacements that's not worth solving since in practice inductor will regenerate the runtime asserts anyways
3) It updates tensorify to specialize more places now that the aforementioned bug is fixed.
Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`
while ensuring `python test/dynamo/test_dynamic_shapes.py DynamicShapesMiscTests.test_runtime_assert_replacement_dynamic_shapes` doesn't regress
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139587
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/ezyang
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Summary:
In most cases, we don't need to turn on AttrProxy tracing for two reasons:
1. It's only needed when you have one submodule owning multiple FQNs.
2. AND it will cause model using module identity to be traced incorrectly (because we substitute module objects at tracing time).
Overall after offline discussion with some export folk, we think it's better to turn off AttrProxy if we can make sure every submodule has unique FQN, which tends to be the common case.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r module_dict_key
Differential Revision: D65555919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139918
Approved by: https://github.com/tugsbayasgalan
Summary:
save around 8% on the torchrec model.
In most case the new implications are not optimizaiton anyway in some case though they are,
but optimizing them is useless.
ex:
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding True
adding False
```
VS
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding 0 <= Mod(s0, 3)
adding Mod(s0, 3) < 0
```
the main difference is that 0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change
this wont happen. but True:True and False: False are useless anyway lol. so its ok i think
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```
<img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738
Approved by: https://github.com/ezyang
ghstack dependencies: #139703
Summary:
While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output).
Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](929797dedb/torch/_subclasses/fake_utils.py (L78))
Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer?
Test Plan: test_export, test_fake_tensor
Differential Revision: D64210055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747
Approved by: https://github.com/zou3519
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.
Consider this snippet:
```
class LinearAndCEL(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(C, V)
self.ce = nn.CrossEntropyLoss()
def forward(self, x, y):
return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```
`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.
The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/tarun292
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:
1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization
This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:
1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.
Consider this snippet:
```
class LinearAndCEL(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(C, V)
self.ce = nn.CrossEntropyLoss()
def forward(self, x, y):
return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```
`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.
The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/138920. See comments there for details.
I still need to try to get a smaller repro to write an actual test. But suppressing the guards, I now no longer see the specilization in the CA graph in the linked example:
```
aot1_view_3: ... = torch.ops.aten.view.default(aot1_tangents_1, [aot1_sym_size_int, 48, 1])
aot1_view_4: ... = torch.ops.aten.view.default(aot1_view_3, [aot1_sym_size_int, 48])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138968
Approved by: https://github.com/yf225, https://github.com/xmfan
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:
1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization
This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:
1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
Summary:
Added where logs are being added to constrain violations in draft export.
Example output:
```
1. Constraint violation error.
The specified input dynamic_shapes spec was found to be incorrect during tracing.
Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}.
This occured at the following stacktrace:
File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1736, in _wrapped_call_impl
File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1747, in _call_impl
File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/scripts/angelayi/draft_export/test_draft_export.py, lineno 138, in forward.
Because of this, we have modified the dynamic shapes structure to be the following:
```
dynamic_shapes = {'a': {0: 3}}
```
```
The result of this diff is also that `dynamic` logs are permanently turned on during draft export. Otherwise we cannot capture the `[guard added]` logs from symbolic_shapes.py.
Test Plan: `buck2 run @//mode/dev-nosan scripts/angelayi/draft_export:test_draft_export -- -r "test_shape_failure" `
Differential Revision: D64862374
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138748
Approved by: https://github.com/ezyang
Looking at the function record_shapeenv_event its hard to tell that it does not always run
but we do disable it by setting top level is_recording to True self.should_record_events is false
this makes it more explicit to avoid confusion and overloading is_recording.
alternativley we can rename is_recording to do_no_record.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138965
Approved by: https://github.com/ezyang
ghstack dependencies: #138804
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large
```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large
```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
if min is None and max is None:
torch._check_is_size(size)
return
```
Partially addresses https://github.com/pytorch/pytorch/issues/128150
When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation. Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments. Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang, https://github.com/bobrenjc93
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599
Approved by: https://github.com/ezyang
Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints.
The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid.
Current behavior is that:
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x - y
inputs = (torch.randn(4, 4), torch.randn(4, 4))
shapes = {
"x": (Dim.AUTO, Dim("d1", min=3)),
"y": (Dim("d0", max=8), Dim.DYNAMIC),
}
ep = export(Foo(), inputs, dynamic_shapes=shapes)
```
The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant.
Differential Revision: D64636101
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490
Approved by: https://github.com/avikchaudhuri
Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`. This causes an issue if one of the values inside of the tuple will not be properly encoded. In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123
Approved by: https://github.com/jansel
Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g.,
```
(a + b)^2 / (a + b) --> (a + b)
```
which won't happen if we expand eagerly during product construction:
```
(a^2 + 2ab + b^2) / (a + b) --> no change
```
Fixes#136044.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235
Approved by: https://github.com/ezyang
Summary:
add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes.
add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`.
- In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`.
if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806).
```
- torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function
- torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__
- torch.amp._enter_autocast()'s torch.overrides.handle_torch_function
- PreDispatchTorchFunctionMode.__torch_function__
```
- in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes.
- to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_set_grad
```
Differential Revision: D64016023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495
Approved by: https://github.com/bdhirsh
We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call.
Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.)
Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls.
Test Plan: Like D64014936
Differential Revision: D64136277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646
Approved by: https://github.com/angelayi
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.
The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.
An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :
```
from torch._inductor.bisect_helper import BisectionManager
if op in CURRENT_DECOMPOSITION_TABLE:
if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
return NotImplemented
```
Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.
We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.
Fix for https://github.com/pytorch/pytorch/issues/126546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
Partially addresses https://github.com/pytorch/pytorch/issues/128150
When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation. Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments. Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.
update_hint_regression benchmark, before and after:
```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.
Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
To see the payoff, look at test/dynamo/test_logging.py
The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations.
The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917
Approved by: https://github.com/anijain2305
Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found.
Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions
```
Differential Revision: D63347675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631
Approved by: https://github.com/jfix71
By refactoring this way, I can put a non-expiring LRU cache here.
Splitting also will make it easier for me to tell who is using up all
the time.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107
Approved by: https://github.com/aorenste
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
If node is AC region output and has a backward hook on it, we intentionally choose to save it.
This is to work around circular dependencies in Traceable FSDP2+AC.
Example:
```
out = fully_shard(utils.checkpoint(module))(x)
norm_out = layer_norm(out)
```
and there is a circular dependency:
1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`.
2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed.
3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad` -> circular dependency with (1)!
Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency.
----
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727
Approved by: https://github.com/Chillee
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
Summary:
Sometimes we only want to generate a replacement for a matched pattern
once we know some information about the nodes in the pattern.
So far, we have found this the most useful to do matches based on specific
shapes of tensors flowing into functions.
Use a callback function similar to `match_filters`. By default this isn't used.
Had to make `replacement` a None-able parameter because Callable was
already used to detect a case where a graph needed to be traced.
Differential Revision: D62412628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553
Approved by: https://github.com/SherlockNoMad
Summary: as title
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86
```
CI
Differential Revision: D62448302
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623
Approved by: https://github.com/tugsbayasgalan
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
Summary:
resnet152 spent about 15 minutes writing warning messages in _unlift
during `to_executorch` because they're all written to unbuffered stderr
by the `warnings` module.
These warnings are almost always about get_attr nodes referencing a
non-existent name:
```lang=py
warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does '
'not reference an nn.Module, nn.Parameter, or buffer, which is '
'what \'get_attr\' Nodes typically target'
)
```
I'm not aware of a way to configure the warnings module to write this out
at most once, so I'm just going to disable the lint for now.
Test Plan:
Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now
Differential Revision: D62156090
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069
Approved by: https://github.com/yushangdi
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/
This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that:
I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).
In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.
I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
This is a bit twisty and I don't entirely understand the situation, but here's my best explanation.
In https://github.com/pytorch/pytorch/pull/133588 I am trying to fix a problem reported by user in https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/ The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).
In #133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.
I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.
But I don't entirely understand all the interactions. I just know that this seems to not cause tests to fail, and it should fix the internal issue (which I need to add a UT for.)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134407
Approved by: https://github.com/ydwu4
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))).
We have two special treatments for following cases:
1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`, which mutates captured states when calling `__getattr__`.
Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy.
2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ | __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer.
Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731
Approved by: https://github.com/anijain2305
ghstack dependencies: #134775
Summary:
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis
This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10
Differential Revision: D61863394
Pulled By: pianpwk
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134598
Approved by: https://github.com/angelayi
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis
This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.
Differential Revision: D61677956
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134271
Approved by: https://github.com/avikchaudhuri
Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile:

On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace.
The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation.
The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments:
* I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning
* There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change)
* I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is.
Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319
Approved by: https://github.com/Skylion007
Summary:
Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup.
This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS` enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.:
```
TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ...
```
This will show logs with something like:
```
...
prim::device called at .../example.py:4284 in foo
TensorBase.item called at .../example.py:4277 in bar
...
```
We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late.
Test Plan: ran it on some sample commands
Differential Revision: D61692156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298
Approved by: https://github.com/angelayi
Starter version of automatic dynamic shapes for export.
Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining".
The usage for specifying `dynamic_shapes` is now:
```
AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static
None/int/STATIC -> static
Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified.
```
Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic:
- specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None})
```
Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic:
- The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related
```
Implementation details:
This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints.
For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}`
This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are:
```
None/unspecified: dynamic by default
Dim/DerivedDim: also a strict assertion
```
If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620
Approved by: https://github.com/avikchaudhuri
Summary: Currently, for sequential mode, minimizer search terminates after a node is excluded via the user defined exclusion_fn. However, on some occasions we would like the search to continue past that for the remaining nodes. In this diff I am changing the termination criteria to respect the find_all setting, where we continue sequential search if it is set.
Test Plan: CI
Differential Revision: D61720262
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134339
Approved by: https://github.com/jfix71
Part of #134054.
This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
Summary: When deepcopy a proxy, we first try the default deepcopy behavior.
Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r proxy_deepcopy
Differential Revision: D61398418
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133706
Approved by: https://github.com/angelayi
Summary: Some symbols (unbacked symints?) can have upper bound that is `sys.maxsize - 1` but our code for runtime assertions assumes that such upper bounds would come in as `sympy.oo` (like backed symints?) in order to drop them. So we weren't dropping them, which this PR fixes.
Test Plan: added test
Differential Revision: D61352056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133627
Approved by: https://github.com/SherlockNoMad
Updating the source matcher to also accept pattern matching on the torch_fn metadata, which exists in both strict and non-strict export. We want to replace the use of source_fn_stack with torch_fn, as it's not possible for us to get source_fn_stack in non-strict export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133642
Approved by: https://github.com/ydwu4
Summary: Switch to set_proxy_slot instead of set the proxy directly on the Tensor. We do not want to add Proxy to tensor objects, because Proxy cannot be deepcopied or pickeled and can cause problems when users want to deepcopy or pickle models.
Test Plan: CI
Differential Revision: D61277650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133470
Approved by: https://github.com/zou3519
Fixes#124550
Also moves `graph.eliminate_dead_code()` call to a few lines after
`_inline_module(...)` in `const_fold.py`
* Test plan:
Add a new test on `test_eager_transforms.py` to ensure the reported
issue was indeed fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133364
Approved by: https://github.com/zou3519
Summary:
Fixes https://github.com/pytorch/pytorch/issues/133336
When we fail to suggest fixes for a data dependent error because some symbols couldn't be mapped to sources, we print out those symbols but there was a silly bug in the printing code.
New error:
```
...
raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1))) (unhinted: Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1)))). (Size-like symbols: u0)
Potential framework code culprit (scroll up for full backtrace):
File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/torch/_refs/__init__.py", line 2972, in expand
guard_size_oblivious(requested_length == x)
For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing
For C++ stack trace, run with TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
The following call raised this error:
File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/pyspeech/nn/utils.py", line 271, in lengths_to_padding_mask
).expand(batch_size, max_length)
```
Test Plan: Repro gets past reported error, hits new error
Differential Revision: D61221994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133345
Approved by: https://github.com/ezyang
Summary: Model owners can set the lower_settings with max_acc_splits=2, and lowering will fail during model iteration, to alert them of possible performance degradation from increased fragmentation.
Test Plan: Added unit tests
Differential Revision: D60133589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133041
Approved by: https://github.com/hl475
Summary:
Add a special field in Graph and Node level metadata called "custom" which should be mapped to a json-serializable object, and we guarantee this field should be always preversed across the following transformations:
1. copy/deepcopy
2. run_decompositions()
3. serialization
4. re-exporting
Test Plan: :test_export -- -r custom_tag
Reviewed By: angelayi
Differential Revision: D60291839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131912
Approved by: https://github.com/angelayi
#### Description
Transform quantized operation properly. Add de/quantization before and after the quantized operation.
#### Test Plan
`pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133026
Approved by: https://github.com/angelayi
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.
I annotated the PR with explanation of changes.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.
Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
#### Description
Transform quantized operation properly. Add de/quantization before and after the quantized operation.
#### Test Plan
`pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131915
Approved by: https://github.com/angelayi
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.
I annotated the PR with explanation of changes.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
Summary:
- make default DCE pass check schema,
- need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added).
- mark Proxy dump as NotImplemented for better error msg
- Remove Proxy from tensors when dumping models, as Proxy cannot be dumped.
More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing.
Test Plan:
CI
```
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r qat_conv2d
- test_export.py
- buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export
- buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r dce
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_fold_bn_erases_bn_node
```
Reviewed By: angelayi
Differential Revision: D60319175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764
Approved by: https://github.com/angelayi
Combines contributions from https://github.com/pytorch/pytorch/pull/130505
Some context can be found in this large comment block:
a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)
Changes in this PR
- For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager.
- Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic).
- (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.)
Basically unchanged, but worth noting:
- `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified.
- to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack.
Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131916, #131803
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.
Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
https://github.com/pytorch/pytorch/pull/130775 recently killed forced specializations for export on complex guards, so the only way we now get a specialized value is if we're able to solve for it. For example, if we have guards `s0 * 2 = s1`, `s0 + 6 = s1`, we specialize `s0 = 6; s1 = 12`.
That might look like this:
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x.reshape([-1]) + y
dy = Dim("dy", min=6)
x, y = torch.randn(6, 2), torch.randn(12)
dynamic_shapes = {
"x": (dy - 6, 2),
"y": (dy,),
}
```
Our current error message is:
`{symbol} must be specialized to {value} because the guards generated for it are too complex`
This is now misleading, so we change it to:
`solving the guards generated for {symbol} resulted in a specialized value of {value}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132698
Approved by: https://github.com/avikchaudhuri
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
ghstack dependencies: #131649
Some sympy Functions aren't supported by sympy_interp(); we can't turn them into FX nodes, so currently the runtime asserts CSE pass avoids CSE'ing on any expression containing a sympy Function. https://github.com/pytorch/pytorch/pull/132325 started tracking unsupported functions, so we switch the check to that to be more precise. We also check for and skip unsupported functions when adding asserts - previously we only did the check for CSE, and not adding new expressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132457
Approved by: https://github.com/avikchaudhuri
python_code(verbose=True) (or print_readable()) generates a string with the code representing the fx graph, with extra annotations indicating the size or stride of the tensor. Currently, it'll only shows sizes/strides for FakeTensors provided in metadata. For subclass tensors like NestedTensor, the outer class (provided in the node metadata) will be a non-FakeTensor and the inner tensors will be fake. This PR expands the conditional to show sizes/strides for all tensors, not just FakeTensors.
Testing: I ran this test script (below), ran it with `TORCH_LOGS=+dynamo` and found in the logs the graph shown below - we see that the input nested tensor has sizes and strides associated with it. Also, I stacked a diff on top of this one that forces the readable graph to be generated whenever PT2 is in use in tests, which should hopefully find any issues; https://github.com/pytorch/pytorch/pull/132195 shows no significant failures except for preexisting failures.
test script:
```python
import torch
def fn(x):
return x.cos()
nt = torch.nested.nested_tensor_from_jagged(
torch.randn(10, 10),
torch.tensor([0, 1, 3, 6, 10]),
)
torch.compile(fn)(nt)
```
logs excerpt:
```
[0/0] [__graph_code] TRACED GRAPH
[0/0] [__graph_code] ===== __compiled_fn_1 =====
[0/0] [__graph_code] /data/users/dberard/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.M
[0/0] [__graph_code] def forward(self, L_x_: "f32[4, zf1, 10][10*zf1, 10, 1]cpu", zf1: "Sym(zf1)"):
[0/0] [__graph_code] l_x_ = L_x_
[0/0] [__graph_code]
[0/0] [__graph_code] # File: /data/users/dberard/scripts/nt_print_graph.py:4 in fn, code: return x.c
[0/0] [__graph_code] cos: "f32[4, zf1, 10][10*zf1, 10, 1]cpu" = l_x_.cos(); l_x_ = None
[0/0] [__graph_code] return (cos,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132192
Approved by: https://github.com/Chillee
Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]` to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner.
Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466
Approved by: https://github.com/zou3519
ghstack dependencies: #132356
This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway...
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060
Approved by: https://github.com/Skylion007
This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway...
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060
Approved by: https://github.com/Skylion007
ghstack dependencies: #131649, #132407
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
ghstack dependencies: #131649
Summary: Currently suggested fixes pick a map from symbols to user variables. However it is possible that many user variables point to the same symbol, and some may be preferred over others. Thus we dump this info as well.
Test Plan: updated test
Sample error with new format:
```
Could not guard on data-dependent expression u2 >= 0 (unhinted: u2 >= 0). (Size-like symbols: none)
<snip>
The following call raised this error:
File "test/export/test_export.py", line 1950, in forward
return r.view(items[0], items[2])
To fix the error, insert one of the following checks before this call:
1. torch._check(items[2] >= 0)
2. torch._check(items[2] < 0)
(These suggested fixes were derived by replacing `u2` with items[2] in u2 >= 0 and its negation.)
```
Differential Revision: D60574478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132393
Approved by: https://github.com/BoyuanFeng
Summary:
A bunch of issues around support for sympy functions like `TruncToInt` and `ToFloat` are uncovered by https://github.com/pytorch/pytorch/issues/131897. This PR addresses only one of them (as the title suggests). Another issue is deserialization, filed as a task: T197567691.
However the most important issue is that adding runtime assertions is broken right now: specifically, sympy_interp with `PythonReferenceAnalysis` currently doesn't work because the implementations of some of these sympy functions in `PythonReferenceAnalysis` (or falling through to its base class) does not expect proxies. This means things like `math.trunc`, `math.floor`, `round`, etc. don't work, and can be easily repro'd by using them inside `torch._check`, e.g. According to ezyang these implementations need to point to new torch functions that can expect proxies (see how minimum and maximum are implemented, e.g.).
Test Plan: added test (original repro provided)
Differential Revision: D60540951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132325
Approved by: https://github.com/ezyang
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```
We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.
in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.
This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD