Summary:
I am really skeptical about inductor sizevars creating an empty shape env when not provided with one
i think we should fail there if the graph has dynamic shapes and no shape env is provided.
however i wonder if there are actually use cases that depends on the shape env not being there?
Reasoning APIs depends on facts in the shape env. and assumes some stuff exists for specific symbols.
Test Plan:
Fix the bug reported in creating simple e2e unit test is not trivial
https://www.internalfb.com/diff/D82337184
Rollback Plan:
Differential Revision: D82412384
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162927
Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel
Summary:
When rewriting sympy expressions in the compiler codebase we want to generate
FloorDiv(a, b) CleanDiv(a, b) directly and not a//b. since the later become floor(a*pow(b, -1))
For symnodes we automatically handle that conversions in the symnode op dispatch.
I will follow up with an issue to track all other usages of //.
Block internal Model.
Test Plan:
add test
run existing tests.
dakechen1993 testing on the model.
Rollback Plan:
Differential Revision: D82362241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162869
Approved by: https://github.com/ezyang
Summary:
Sometimes `ShapeEnv.create_symbol` can return a `sympy.Integer`. This messes up our phantom symbol infra for derived dims.
Fixes#161902
Test Plan:
added test based on repro
Rollback Plan:
Differential Revision: D81960709
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162416
Approved by: https://github.com/tugsbayasgalan
Summary:
[reland]
Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).
Test Plan:
updated tests
Rollback Plan:
Differential Revision: D81334984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161794
Approved by: https://github.com/zhxchen17
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).
Test Plan:
updated tests
Rollback Plan:
Differential Revision: D79903317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
Summary: Since `allow_complex_guards_as_runtime_asserts` is now sync'd with `prefer_deferred_runtime_asserts_over_guards`, we can kill the former (especially since it was a export-only concept).
Test Plan:
updated tests
Rollback Plan:
Differential Revision: D79903317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160198
Approved by: https://github.com/ezyang
The motivation for this change can be seen through the following example:
```
import torch
GPU_TYPE = "cuda"
@torch.compile
def no_override(x):
return x.sum(dim=0)
@torch.compile
def override(x):
return x.sum(dim=0)
x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```
Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:
```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
xnumel = 16384
rnumel = r0_numel
```
With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:
```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, ks0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
xnumel = 1024000
rnumel = r0_numel
```
This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:
```
f(s0) -> f(s2)
f(s1) -> f(s2)
```
could generate different kernels. With the new approach, an explicit override pins the chosen configuration:
```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```
ensuring consistent kernel generation regardless of input order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
Summary: ONNX team and recent transformer upgrade ran into this error and we also ran into during our export benchmarking. This diff makes it possible to trace through vmap implementation in pre-dispatch IR. Note that we don't support serializing functorch ops in pre-dispatch IR and in the future, we should desugar them to post-grad ops.
The implementation strategy is:
1. We add python wrappers around vmap APIs so that we attach custom torch function handler that is only on during non-strict export. The reason is we don't want to add this to default torch_function handler because it will break BC.
2. Some dynamo changes to make sure it picks up new python wrapper APIs. The reason is when we do strict export, we need to re-materialize these APIs in pre-dispatch IR from torch IR. We can avoid this by special casing in dynamo for export to proxy different API calls but i feel that is too much chaos because you need to be able to proxy 2 different variants of same vmap API.
Test Plan: CI
Differential Revision: D75623875
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154650
Approved by: https://github.com/ezyang, https://github.com/zou3519
The motivation for this change can be seen through the following example:
```
import torch
GPU_TYPE = "cuda"
@torch.compile
def no_override(x):
return x.sum(dim=0)
@torch.compile
def override(x):
return x.sum(dim=0)
x_small = torch.randn(4096, 512, device=GPU_TYPE)
no_override(x_small)
torch._dynamo.decorators.mark_dynamic(x_small, 0, hint_override=4096 * 1000)
override(x_small)
```
Previously, when reductions were split, codegen relied only on the first observed shape. With a small input, this resulted in a small split size:
```
def triton_per_fused_sum_1(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr):
xnumel = 512
r0_numel = 32
```
With the new scheme, inductor honors hint_override during codegen, producing larger and more appropriate split sizes:
```
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr):
xnumel = 16384
r0_numel = 128
```
This addresses a broader problem with dynamism: performance and numerics previously depended on whichever shape was seen first. For example:
```
f(s0) -> f(s2)
f(s1) -> f(s2)
```
could generate different kernels. With the new approach, an explicit override pins the chosen configuration:
```
f(s0, hint_override=s0) -> f(s2)
f(s1, hint_override=s0) -> f(s2)
```
ensuring consistent kernel generation regardless of input order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161007
Approved by: https://github.com/jansel
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`
Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta | +990 | +9 | +44.43% | +155 | 0 | +42.82% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397
Approved by: https://github.com/anijain2305
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
Creating contiguous strides creates an expression max(1, x). Often we know that x >= 1, in
which case we should simplify max(1, x) to x.
This appeared in two situations:
1) An internal user complained about statically_known_true(x == max(1, x)) failing (internal link: https://fb.workplace.com/groups/1028545332188949/permalink/1232958568414290).
This https://github.com/pytorch/pytorch/pull/155938 won't be needed with this.
3) Not simplifying the above could result in wrong ConstraintViolationErrors.
Because we assume non-trival single arg guards shall evaporate see the logic in the function
issue_guard in symbolic_shapes.py
with this change we longer throw ConstraintViolationErrors with the program bellow
this is blocking landing this [PR](https://github.com/pytorch/pytorch/pull/155590) from landing
internally. Due to internal export tests throwing ConstraintViolationErrors.
like
```
Constraints violated (width)!
- Not all values of width = L['x'].size()[3] in the specified range 224 <= width <= 455 satisfy the generated guard max(1, 1 + (((-1) + L['x'].size()[3]) // 2)) == (1 + (((-1) + L['x'].size()[3]) // 2)).
````
```
x = torch.rand(10)
torch._dynamo.mark_dynamic(x, 0, max=20, min=5)
@torch.compile(fullgraph=True, dynamic=True)
def func(x):
if max(1, (-1 + x.size()[0]//2)) == (-1+x.size()[0]//2):
return x*400
else:
return (x*10)*100
func(x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157189
Approved by: https://github.com/pianpwk
Differential Revision: D77249427
Due to memoization and graph order update, it can happen that a backed symbol is passed into compute_unbacked_bindings and lead to failure. An example as follow:
- There are 2 boolean indexing operators (e.g. op1 and op2) with the same mask.
- A unbacked symint is generated from op1, and then op2 reuses the unbacked symint due to a nonzero_memo in nonzero's fake implementation and no rebinding is needed for op2.
- Since op1 generated the unbacked symint, its meta has "unbacked_bindings" field filled and op2's meta doesn't have it.
- Output from op1 and op2 are later concated with others with backed symint, so that the unbacked symint can be replaced by a backed symint.
- In Inductor, during fake tensor prop, there is no memoi because new fake tensor is always generated (for the same node). op1 generates an unbacked symint and the unbacked can be rebound successfully to the backed symint. Since there is no memoi, op2 also generates a new unbacked symint, but no rebinding can happen because op2's meta doesn't have "unbacked_bindings". And "compute_unbacked_bindings/_rename_unbacked_to" fails to assert op2's old symbol to be unbacked.
From discussion with [@ezyang](https://www.internalfb.com/intern/profile/?id=503862770), there is no easy way to fix this issue.
- We can try to enable memoization for fake tensor prop in Inductor, however, we need to ensure that op1 is visited before op2 during Inductor fake tensor prop for this to work (op2's meta doesn't have "unbacked_bindings" so no rebinding can happen and we need to do rebinding from op1. But there are passes such as reorder_for_locality that can change the graph order so this doesn't work.
- A simple hack is to just replace the unbacked symbol in op2 by the backed symbol.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156911
Approved by: https://github.com/ezyang
Summary:
- Consolidate the stack trace recording code in TracerBase and PythonKeyTracer
- Change `make_fx`'s arg name to be consistent with TracerBase member name `record_stack_traces`
We move the stack trace logic from `create_proxy` to `create_node` so all inherited classes of TracerBase and re-use the same stack trace logic.
Test Plan:
```
buck run caffe2/test:test_export -- -r test_stack_trace
```
Rollback Plan:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156257
Approved by: https://github.com/angelayi, https://github.com/zou3519
Previously specialization error messages would render sources that were pretty far from source-code names. E.g., given args named `x, y, zs`, the source for `y.size()[0]` would be rendered as `args[0][1].size()[0]`.
This is because we created artificial local names following `(args, kwargs)` structure instead of reusing signatures. This PR fixes that situation.
Basically we map prefixes of key paths that correspond to original arg names to root sources corresponding to those names; the rest of the key paths hang from these root sources.
Differential Revision: [D76461391](https://our.internmc.facebook.com/intern/diff/D76461391/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155738
Approved by: https://github.com/bobrenjc93
Previously when processing `sym_and(a, b, c)`, symbolic shapes wouldn't individually process a, b, and c and store their implications. This would lead us to data-dependent error on individual checks, e.g. we stored `u0 >= 0 & u0 <= 10`, but then couldn't figure out `u0 <= 10`.
This handles that, and also makes `sym_and/or` user-code friendly, for testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154737
Approved by: https://github.com/laithsakka
Example new error message
```
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
- You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO.
Framework stack:
File "??", line 0, in _start
File "", line 0, in __libc_start_main_alias_2
File "??", line 0, in __libc_start_call_main
File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 1094, in Py_BytesMain
File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 357, in pymain_run_file_obj
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 90, in _PyRun_AnyFileObject
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 456, in _PyRun_SimpleFileObject
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1208, in pyrun_file
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1312, in run_mod
File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1291, in run_eval_code_obj
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 1134, in PyEval_EvalCode
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/scratch/repro.py", line 9, in <module>
foo(x)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper
return fn(*args, **kwargs)
File "offloadstuff.c", line 0, in dynamo__custom_eval_frame
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 305, in _PyObject_Call
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1469, in __call__
return self._torchdynamo_orig_callable(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1248, in __call__
result = self._inner_convert(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 625, in __call__
return _compile(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1092, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_utils_internal.py", line 97, in wrapper_function
return function(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 779, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 818, in _compile_inner
out_code = transform_code_object(code, transform)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
transformations(instructions, code_options)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 265, in _fn
return fn(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 743, in transform
tracer.run()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 3531, in run
super().run()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1359, in run
while self.step():
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1263, in step
self.dispatch_table[inst.opcode](self, inst)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 422, in impl
self.push(fn_var.call_function(self, self.popn(nargs), {}))
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
return handler(tx, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 792, in <lambda>
return lambda tx, args, kwargs: obj.call_function(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function
return handler(tx, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1120, in _handle_insert_op_in_graph
return wrap_fx_proxy(tx, proxy)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2500, in wrap_fx_proxy
return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2566, in wrap_fx_proxy_cls
return _wrap_fx_proxy(
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2664, in _wrap_fx_proxy
example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3205, in get_fake_value
ret_val = wrap_fake_exception(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 2705, in wrap_fake_exception
return fn()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3206, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3373, in run_node
return node.target(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5917, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 430, in cfunction_vectorcall_FASTCALL
File "/usr/local/src/conda/python-3.10.16/Objects/abstract.c", line 891, in binary_op1
File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7284, in slot_nb_multiply
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/descrobject.c", line 344, in method_vectorcall_VARARGS_KEYWORDS
File "python_variable_methods.cpp", line 0, in _object* torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul>(_object*, _object*, _object*)
File "python_variable_methods.cpp", line 0, in torch::autograd::THPVariable_mul(_object*, _object*, _object*)
File "??", line 0, in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&)
File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
File "PythonFallbackKernel.cpp", line 0, in void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::pythonTLSSnapshotFallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
File "VariableType_0.cpp", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "VariableType_0.cpp", line 0, in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "??", line 0, in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const
File "PythonFallbackKernel.cpp", line 0, in (anonymous namespace)::pythonFallback(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::dispatch(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const
File "??", line 0, in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object*>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName)
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 577, in PyObject_CallMethod
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
return fn(*args, **kwargs)
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1346, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2029, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1442, in _cached_dispatch_impl
return self._dispatch_impl(func, types, args, kwargs)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2552, in _dispatch_impl
return maybe_propagate_real_tensors(fast_impl(self, *args, **kwargs))
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 956, in fast_binary_impl
final_shape = infer_size(final_shape, shape)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 916, in infer_size
torch._check(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1669, in _check
_check_with(RuntimeError, cond, message)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1632, in _check_with
if expect_true(cond):
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1686, in expect_true
return a.node.expect_true(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 552, in expect_true
return self.guard_bool(file, line)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 536, in guard_bool
r = self.evaluate()
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 510, in evaluate
return self.shape_env.evaluate_sym_node(self, size_oblivious)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7113, in evaluate_sym_node
return self.evaluate_expr(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
return retlog(fn(*args, **kwargs))
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7215, in evaluate_expr
return self._inner_evaluate_expr(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper
return retlog(fn(*args, **kwargs))
File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7238, in _inner_evaluate_expr
return self._evaluate_expr(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7505, in _evaluate_expr
self._maybe_guard_rel(g)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6758, in _maybe_guard_rel
self._refine_ranges(expr)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7709, in _refine_ranges
self._set_replacement(
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6667, in _set_replacement
self.framework_specialization_stacks[source] = CapturedTraceback.extract(cpp=True)
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame
File "/home/bobren/local/a/pytorch/torch/utils/_traceback.py", line 207, in extract
torch._C._profiler.gather_traceback(python=True, script=script, cpp=cpp),
File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate
File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall
File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 543, in cfunction_call
File "offloadstuff.c", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
File "offloadstuff.c", line 0, in pybind11::cpp_function::initialize<std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback>, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback> (*)(bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)
File "??", line 0, in torch::CapturedTraceback::gather(bool, bool, bool)
File "??", line 0, in torch::unwind::unwind()
User stack:
File "/home/bobren/local/a/pytorch/scratch/repro.py", line 5, in foo
return torch.randn(5) * x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155603
Approved by: https://github.com/zou3519, https://github.com/cyyever
ghstack dependencies: #155133
Summary:
This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.
There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.
Test Plan: Reran the tests listed in T196779132 and they pass.
## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.
Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
The goal of this multigraph work is to enable a compiled region that has a single dynamo trace but multiple backend specializations. This work was inspired by vLLM which does this in a somewhat hacky way where they use a custom backend to capture a dynamo graph and then manually invoke compile_fx multiple times to get specialized graphs.
There's really two parts of this work:
**The frontend changes:**
1) we introduce an optional kwarg `specialize_on` to mark_{dynamic,unbacked} that takes in a list of specializations. I debated other methods including specifying specializations via decorators, but ultimately decided this approach was more harmonious. The big issue with decorators is the difficulty of composing well with the rest of the torch.compile ecosystem including graph breaks, lazy initialization of variable trackers and symbolic variables, etc.
**The backend changes (this PR):**
1) We capture the backend_specialization specified in the mark_{dynamic,unbacked} API into a SymbolicContext. See changes in `/_dynamo/variables/builder.py`
2) After we are done dynamo tracing, we will lazily (more on this later) invoke `call_user_compiler` up to N + 1 times for N specializations and 1 generic graph. Under the hood this will call compile_fx, which composes nicely with both Async Compile and AOTAutogradCache. We do this by using a context manager to patch in specialization specific axioms into the ShapeEnv before invoking the user compiler.
3) When we have specializations, we install a lazy specialized dispatch function that checks each specialization and dispatches to the first one that matches. Instead of doing all of the specialization compiles up front, we do the compiles lazily. The first time a specialization is invoked, we will do the compilation and save it in a cache so subsequent invocations are fast. If none of the specializations match, we dispatch to the generic graph. I decided to do this over returning N different GuardedCodes since 1) it doesn't pollute the dynamo cache (eg. if you have 8 specializations, you would hit the cache limit) 2) it naturally incorporates the hierarchical lattice structure of the guards since the specializations are always necessarily stricter than the generic region's guards.
I benchmarked this PR stack with #152596 and found around a 50% reduction when dispatching to the specialized regions:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153449
Approved by: https://github.com/zou3519
ghstack dependencies: #153433