Add dynamic shapes doc (#159428)

This PR adds new Dynamic Shapes documentation and expands on the existing one. - Adds a new structure with Intro, Core Concepts, Troubleshooting Pull Request resolved: https://github.com/pytorch/pytorch/pull/159428 Approved by: https://github.com/bobrenjc93 Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-12-06 12:20:52 +01:00 · 2025-09-22 21:01:24 +00:00 · 2025-09-22 21:01:24 +00:00 · 8e62d01f7a
commit 8e62d01f7a
parent 8abc2af9b9
21 changed files with 1292 additions and 146 deletions
--- a/docs/source/_static/img/dynamic_shapes/dynamic_shapes_example_specialization.png
+++ b/docs/source/_static/img/dynamic_shapes/dynamic_shapes_example_specialization.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse10_debugging_guards_unbacked.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse10_debugging_guards_unbacked.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse1_dynamic_shapes_false.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse1_dynamic_shapes_false.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse2_dynamic_shapes_true.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse2_dynamic_shapes_true.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse3_specialization.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse3_specialization.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse4_pgo.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse4_pgo.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse5_dynamic_shapes.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse5_dynamic_shapes.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse6_size_related_recompilations.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse6_size_related_recompilations.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse7_not_size_related_recompilations.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse7_not_size_related_recompilations.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse8_compilation_metrics.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse8_compilation_metrics.png
--- a/docs/source/_static/img/dynamic_shapes/tlparse9_debugging_guards.png
+++ b/docs/source/_static/img/dynamic_shapes/tlparse9_debugging_guards.png
--- a/docs/source/compile/dynamic_shapes_advanced_control_options.md
+++ b/docs/source/compile/dynamic_shapes_advanced_control_options.md
@ -0,0 +1,239 @@
+(dynamic_shapes_advanced_control_options)=
+# Advanced Options to Control Dynamic Behavior
+
+PyTorch provides several advanced options to control dynamic behavior.
+These options requires a deep understanding of the PyTorch internals and
+may inlvolve setting additional tools. These options include:
+
+* Profile-Guided Optimization (PGO) is a technique that allows the compiler
+  to save automatic dynamic decisions and reuse them across jobs.
+* Compiler Collective is a feature that is used to modify automatic dynamic
+  shapes behavior by inferring if an input is dynamic based on whether
+  its size varies across ranks.
+
+## Profile-Guided Optimization (PGO)
+
+Profile-Guided Optimization (PGO) enhances automatic dynamic by sharing profiling decisions across runs of your model. Specifically, it serializes all the choices made by automatic dynamic into a file on disk. You can then copy this file—or store it in a centralized metadata service like S3—and reuse it on other machines to ensure consistent behavior across environments.
+
+For the purposes of the rest of this tutorial, you can use the following environmental variables to turn on PGO locally `TORCH_COMPILE_JOB_ID=1 TORCH_DYNAMO_AUTOMATIC_DYNAMIC_LOCAL_PGO=1`
+
+(identifying-dynamic-elements-marked-by-pgo)=
+### Identifying Dynamic Elements Marked by PGO
+
+Use `tlparse` to find line numbers of interest and check for multiple values
+seen for inputs.
+
+To determine which elements are marked as dynamic by Profile-Guided Optimization (PGO),
+follow these steps using `tlparse`:
+
+1. In the `tlparse` output, identify the line number of the frame of interest. Example:
+
+   ```{image} ../_static/img/dynamic_shapes/tlparse4_pgo.png
+   ```
+
+2. Open `local_code` using `put_local_code_state_` or `put_remote_code_state_` for the
+   latest frame (for example, 6/1).
+
+   Each `?` indicates that multiple values have been observed for this input.
+
+   For instance, the following output shows that the input `L['m']` has been seen with
+   multiple sizes at `size[0]`, but the stride has consistently been 1:
+
+   ```
+   /data/users/bobren/a/pytorch/r2.py:2:func:
+   L['m']: fully dynamic scalar or tensor
+   L['x']: tensor size=[?] stride=[1]
+   L['y']: tensor size=[?] stride=[1]
+   L['z']: tensor size=[?] stride=[1]
+   ```
+
+```{note}
+If an element is marked as dynamic by PGO, it does not guarantee that it will remain dynamic in the graph. Specialization can revert it to a static state.
+```
+
+## Compiler Collective
+
+Different ranks can communicate with each other to share observed sizes. In the second
+iteration, automatic dynamic uses this information to determine which elements to mark
+as dynamic based on inputs seen across all ranks. Check this [PR](https://github.com/pytorch/pytorch/pull/130935) for more details.
+To enable this feature, use `enable_compiler_collectives=True` with the `@config.patch`
+decorator.
+
+```python
+@config.patch(enable_compiler_collectives=True)
+```
+
+```{note}
+This feature enables the use of collectives during compilation to
+synchronize behavior across ranks. Currently, it is used to modify
+automatic dynamic shapes behavior by inferring if an input is dynamic
+based on whether its size varies across ranks. Since this synchronization
+uses collectives, all ranks must run compilation simultaneously; ranks must
+not diverge with graph breaks. This is most reliably achieved by ensuring
+torch is only run on SPMD programs. Violating this invariant may result in
+deadlocking NCCL and encountering a NCCL timeout.
+```
+
+## Reducing Compilations: Step by Step
+
+If you have a model that you can run on your master job and have a `tlparse`,
+here's whatyou should do next:
+
+### Step 1: Mark Dynamic Elements
+
+The first step is to reduce initial compilations that are eventually optimized away
+by automatic dynamic or PGO. This is straightforward because we know it will work
+upfront. If, in one run, a frame starts with static graphs and converges to
+dynamic graphs, and if you notice a reduction in the number of compiled
+frames in a second (warm) PGO-enabled run, it's likely due to this optimization.
+
+This is a two-step process:
+
+1. Find elements marked as dynamic by PGO or automatic dynamic.
+2. Mark them as dynamic using one of the {ref}`user_annotations`.
+
+#### How to Identify Elements to Mark as Dynamic
+
+Follow these guidelines:
+
+1. **PGO artifact:** Follow the steps in {ref}`identifying-dynamic-elements-marked-by-pgo`.
+2. **Dynamic Logs:** If you have a run with `TORCH_LOGS="+dynamic"`, each
+time a new dynamic dimension is allocated, a debug line will specify it
+along with the input name.
+3. **Compare Graphs:** For frames with reduced compilations across runs,
+inspect the Dynamo graphs in the second run or the latest runs in the
+cold run. Look for elements marked as dynamic in those graphs. Specifically,
+find graphs that are similar (once specialized and once dynamic).
+
+Even without a warm run, you can inspect all graphs for a specific frame
+to see if some are similar and converge to a dynamic version.
+
+For example, in the following `tlparse` snapshot, Dynamo graphs 20/0,
+20/1, and 20/2 are similar except for different sizes (for example,
+graph 20/0 vs. graph 20/2). In the Dynamo graph of 20/2, sizes `s0`,
+`s1`, and `s5` are used for `rotary_pos_emb_` and `x`.
+
+```{image} ../_static/img/dynamic_shapes/tlparse5_dynamic_shapes.png
+```
+
+```{tip}
+Two graphs are considered similar if they have the same sequence of calls for
+torch operations and the same tensor inputs. Variations may exist in integer
+inputs that could be inlined in the specialized version or arithmetic
+computations that only exist in the dynamic version due to inlining in the
+static version.
+```
+
+### Step 2: Debugging: Identifying Missed Opportunities
+
+The complexity of debugging can vary greatly depending on the issues you
+encounter. The end result is often to find a bug, enable a flag, or modify
+user/framework code.
+
+#### Finding Similar Graphs
+
+Start by identifying a group of similar graphs that you might want to combine
+into one dynamic graph, as discussed in the previous section on comparing
+graphs. If you can't find any similar graphs, there's nothing further to do
+in this step.
+
+#### Quick Checks: Fail Fast
+
+After finding similar graphs, you want to understand why the have recompilations.
+Check the following:
+
+1. **Check Recompile Reasons:** For graphs you believe are similar, click on
+`recompile_reason` in the `tlparse` output for the later graph. Ensure the
+reason is size-related and not due to other factors. For example, while
+in these screenshot the recomplile reason is size-related:
+
+```{image} ../_static/img/dynamic_shapes/tlparse6_size_related_recompilations.png
+```
+
+In the one below it is not, which indicates that dynamic shapes won't resolve it:
+
+```{image} ../_static/img/dynamic_shapes/tlparse7_not_size_related_recompilations.png
+:width: 500px
+:align: center
+```
+
+2. **Compare Guards Files:** Ensure there are no guards on non-size-related
+elementsthat exist in one graph but not the others.
+
+3. **Early Check for Custom Triton Kernels:** Check if your model calls custom
+Triton kernels with `tl.constexpr` arguments, as these are always
+specialized. If your model receives different values for these arguments,
+it could be a source of recompilation.
+
+
+## **Identifying and Fixing Recompilation Causes**
+
+1. **Is Something Not Marked Dynamic but Should Be?** Determine if an input was
+marked dynamic and got specialized or was not marked dynamic at all. You can
+identify this by:
+
+    * Checking the Dynamo graph - look for `Sym(number)`. For example:
+
+      ```
+      Sym(256) vs Sym(s0)
+      ```
+
+    * Using dynamic logs:
+
+      ```
+      ["TORCH_LOGS=+dynamic"]
+      create_symbol s2 = 2 for L['self']._modules['cle ...
+      ```
+
+    * Reviewing guards files. If a tensor size is dynamic, it will be indicated as `None`:
+
+      ```
+      TENSOR_MATCH:check_tensor(L['self'].x._parameters['weight']], Parameter, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=True, size=[None, None], stride=[None, 1])
+      ```
+
+2. **Why Is It Not Marked Dynamic?** If you determine an element is not marked dynamic, consider:
+
+    * Checking if it's an `nn` module property, parameter, or field. Verify setting for the flags:
+      * `force_parameter_static_shapes = True`
+      * `force_nn_module_property_static_shapes = True`
+      * `allow_unspec_int_on_nn_module = False`
+      * Or using the dynamic allow list to mark it dynamic, which should have the highest priority.
+
+    ```{tip}
+    Marking elements one by one can be time-consuming. Initially, flip the flags to
+    identify any blocking specializations, then decide how to mark them
+    dynamic at the end of the process.
+    ```
+
+    * If you feel, like it could be a bug, please file a bug report and mark
+    with the `module: dynamic shapes` label. Check the list of known issues in
+    [this list](https://github.com/pytorch/pytorch/issues?q=sort%3Aupdated-desc+state%3Aopen+label%3A%22module%3A+dynamic+shapes%22).
+
+3. **Is a Dynamic Element Getting Specialized?** Determine why it is specialized.
+It could be due to user code (such as an `if` condition), framework code, or a
+call  to a Triton kernel. To identify the reason for specialization:
+
+    * **Using tlparse:** Check the `compilation_metrics` for a specialization section, which will indicate what got specialized and the user and framework stack when it happened. Example:
+
+    ```{image} ../_static/img/dynamic_shapes/tlparse8_compilation_metrics.png
+    ```
+
+    The log above indicates that `s0` is specialized to `33` due to the following code:
+
+    ```
+    `if self.x ==33` at example4.py line 16.
+    ```
+
+    * **+Dynamic Logs:** pass `["TORCH_LOGS=+dynamic"]`. Look for the first specialization, as once a variable is specialized, all dependent variables get specialized too.
+
+    Example log:
+
+    ```
+    torch/fx/experimental/symbolic_shapes.py:6557] [0/2] eval Eq(s0, 33) [guard added] if self.x ==33:  # example4.py:16 in forward (_dynamo/variables/tensor.py:1242 in evaluate_expr), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(s0, 33)"
+    V0228 12:04:24.190000 2990033 torch/fx/experimental/symbolic_shapes.py:6000] [0/2] _update_var_to_range s0 = VR[33, 33] (update)
+    ```
+
+    The log above indicates that `s0` is specialized to `33` due to the following code:
+    ```
+    if self.x ==33. At example4.py like 16.
+    ```
--- a/docs/source/compile/dynamic_shapes_backed_unbacked.md
+++ b/docs/source/compile/dynamic_shapes_backed_unbacked.md
@ -0,0 +1,45 @@
+(backed-vs-unbacked-symints)=
+# Backed vs Unbacked Symints
+
+Backed `SymInts` are symbolic integers that have a concrete value or "hint"
+associated with them. This means that torch can use these values to make
+decisions about control flow, such as determining which branch of code
+to execute. They are typically derived from operations where the size or
+value is known or can be inferred.
+
+Unbacked `SymInts` are symbolic integers that do not have a concrete value or
+hint. They often arise from data-dependent operations, such as `.nonzero()`
+or `.item()`, where the size or value cannot be determined at compile time.
+Since they lack a concrete value, they cannot be used for control flow
+decisions, and attempting to do so requires a graph break.
+
+Unbacked `SymInts` use *oblivious-size reasoning* which is particularly
+useful when you are dealing with
+{ref}`0/1 specialization recompilation problem <zero-one-specialization>`.
+
+In summary, backed `SymInts` have known values that can be used for
+decision-making, while unbacked `SymInts` do not, requiring special handling
+to avoid graph breaks.
+
+Unbacked symbolic integers can be too restrictive, causing most PyTorch programs
+to fail. To address this, you can use the following methods and APIs as
+workaround:
+
+* Use higher-level APIs like `empty` instead of `empty_strided` to create tensors.
+This ensures the tensor is non-overlapping and dense, avoiding unnecessary stride
+sorting and guard creation.to avoid unnecessary recomputation of these properties.
+
+* Modify your code to make precomputed properties *lazy*. This ensures that
+guards on unbacked symbolic integers are only applied when necessary,
+reducing computational overhead.
+
+## How to use unbacked
+To use unbacked APIs, replace `mark_dynamic` with `mark_unbacked` and
+`TORCH_COMPILE_DYNAMIC_SOURCES` with `TORCH_COMPILE_UNBACKED_SOURCES`.
+This tells the compiler to treat an input as unbacked.
+
+```{seealso}
+* {ref}`dynamic_shapes`
+* {ref}`torch.export`
+* {ref}`what_is_a_specialization`
+```
--- a/docs/source/compile/dynamic_shapes_beyond_the_basics.md
+++ b/docs/source/compile/dynamic_shapes_beyond_the_basics.md
@ -0,0 +1,10 @@
+(dynamic_shapes_beyond_the_basics)=
+# Beyond the Basics
+
+This section covers some advanced topics related to dynamic shapes. This includes more complex explanations of how dynamic shapes work, 0/1 specialization problems, and so on.
+
+```{toctree}
+:maxdepth: 1
+dynamic_shapes_zero_one_specialization
+dynamic_shapes_backed_unbacked
+```
--- a/docs/source/compile/dynamic_shapes_core_concepts.md
+++ b/docs/source/compile/dynamic_shapes_core_concepts.md
@ -0,0 +1,134 @@
+(dynamic_shapes_core_concepts)=
+# Dynamic Shapes Core Concepts
+
+This section described the core concepts of dynamic shapes in PyTorch. It is intended to be a
+reference for engineers working on the PyTorch compiler stack and anyone who wants to understand
+the inner workings of dynamic shapes.
+
+## Symbolic integers
+Symbolic integers (Symints) are used to represent variables that can span a range. For example:
+```python
+x = torch.randn(5, 5) # this tensor has a shape [5, 5]
+torch._dynamo.decorators.mark_dynamic(x, 0)
+x = torch.randn(5, 5) # this tensor has a shape [s0, 5]
+y = torch.cat([x, x], dim=0) # this tensor has a shape [2*s0, 5]
+```
+
+However, `z = x * y` would throw an error since we know that pointwise operation like multiply must
+operate on same sized tensors but we know statically `s0 != 2 * s0`. Astute readers may point out
+that this is not true when `s0 == 0` and the reason why that doesn't matter here is described in
+{ref}`zero-one-specialization`.
+
+## Guards
+
+In `torch.compile`, a guard is a mechanism that is used to ensure the validity of a compiled code graph.
+By default, when you make a variable dynamic, it can range from `[-inf, inf]`. For example:
+
+```python
+def foo(x): return x / 2
+
+This works for any dynamic x. But if your code is:
+
+def foo(x)
+    if x > 5:
+        return x / 2
+    return x / 3
+```
+If you call `foo(6)`, it returns `x / 2` and adds a guard `x > 5`. Calling `foo(4)` later will
+require recompilation because the guard is broken.
+
+## Runtime Asserts
+You can use runtime asserts to provide hints when you know certain facts, like batch size being less than 100:
+
+```python
+def foo(batch_size):
+    torch._check(batch_size < 100)
+    if batch_size < 100:
+        return do_something
+    return do_something_else()
+```
+
+## "Hint" Value
+
+A "hint value" in the context of `torch.compile` refers to the actual values known during the compilation process that help the JIT compiler make decisions about expressions. Hint values are particularly useful for handling dynamic shapes, as they provide concrete information that guides the compilation without requiring recompilation for varying dimensions.
+
+
+## Dynamic Behavior Overview
+
+PyTorch assumes static shapes by default. When a size change is detected, it attempts to
+recompile with dynamic input, although this may fail if there are conditional branches
+or missing support for dynamic shapes. To diagnose overspecialization, you can set
+`TORCH_LOGS=dynamic` to view "eval" entries that indicate when and why guards are added.
+
+If you anticipate a dimension will be dynamic, you can use `torch._dynamo.mark_dynamic(tensor, dim)`
+to mark it in advance, specifying `min` and `max` values if known. Using `torch.compile(dynamic=False)`
+disables automatic dynamic shapes, leading to recompilation for each unique size. Conversely,
+`torch.compile(dynamic=True)` aims to use dynamic shapes as much as possible which is most useful
+for small and may not be suitable for large models due to potential crashes or performance issues.
+
+You can whitelist specific sources to be marked as dynamic using the `TORCH_COMPILE_DYNAMIC_SOURCES` environment variable or `torch.compiler.config.dynamic_sources`. This is particularly useful for large
+models with graph breaks, as you can maintain dynamism across graph breaks since
+source names stay consistent. You can also use this to mark integers as dynamic. The format is a comma-delimited list of source names, for example, `"L['x'], L['y']"`.
+You can also use regexes, for example, `"L\['x.*'\], L\['y.*'\]")`.
+This whitelist takes precedence over other flags like `dynamic=False` `force_nn_module_property_static_shapes`, and `force_parameter_static_shapes`.
+
+Sometimes it can be cumbersome to find the right inputs to mark as dynamic. If
+you're willing to take a performance hit for the first batch, one other affordable
+option we have are the `eager_then_compile` stances which derive dynamism for you.
+See {func}`torch.compiler.set_stance` for more details.
+
+
+## Overall Architecture
+
+Symbolic shapes workflow:
+
+1. When compiling a frame in Dynamo, we allocate a `ShapeEnv` (attached to `FakeTensorMode`) to
+track symbolic shapes.
+2. We allocate symbolic sizes for tensors on entry, based on policy decisions.
+3. We propagate symbolic sizes through operators, maintaining both FX IR for symbolic compute export
+and Sympy expressions for reasoning.
+4. We add guards based on conditionals during Dynamo tracing or Inductor optimization, induced from both Python and C++.
+5. Guards can simplify symbolic variables. For instance, asserting `s0 == 4` allows replacing all occurrences of `s0` with `4`.
+6. After tracing and optimizing, we install all guards with the compiled code, ensuring reusability only if all guards evaluate true.
+
+## Internal API Class Hierarchy
+
+### Python Classes
+
+- **`SymInt`/`SymFloat`/`SymBool`**: User-visible classes that simulate their `int`/`float`/`bool` counterparts. Adding two `SymInts` produces a new `SymInt` that symbolically tracks the integer addition.
+
+- **`SymNode`**: Internal structure (accessible via `symint.node`) that holds actual symbolic tracking information. `SymNode` is type-erased, making it convenient to represent mixed-type operations.
+
+- **`ShapeEnv`**: Per-compile context state that tracks all free symbols and guards accumulated so far. Every `SymNode` records its `ShapeEnv` (but not vice versa; `SymNodes` are only used if they participate in a guard).
+
+### C++ Equivalents
+
+- **`c10::SymInt`/`SymFloat`/`SymBool`**: User-visible classes that simulate `int`/`float`/`bool`
+- **`c10::SymNode`/`SymNodeImpl`**: Analogous to Python `SymNode`
+- **No C++ `ShapeEnv`**: For debugging ease, the entire symbolic reasoning apparatus remains in Python
+
+When writing code traceable with `make_fx`, it must handle `SymInt`/`SymFloat`/`SymBool` flowing through it.
+
+## Value Ranges and Constraints
+
+Symbolic variables maintain **value ranges** that specify the set of possible values. By default:
+- Size-like unbacked `SymInts` have value range `[0, Inf]`
+- Regular unbacked `SymInts` have value range `[-Inf, Inf]`
+
+When assertions are made (e.g., `torch._check(x == y)`), the system:
+1. Attempts to replace unbacked symbols with equivalent expressions
+2. Refines value ranges based on the assertion
+3. Remembers boolean expressions that are always true
+
+Important files:
+
+- C++ SymInt API: `c10/core/SymInt.h`, `SymFloat.h`, `SymBool.h`
+- Python SymInt API: `torch/__init__.py` (look for `SymInt/SymFloat/SymBool`)
+- C++ plumbing: `c10/core/SymNodeImpl.h`, `torch/csrc/utils/python_symnode.h`, `torch/csrc/jit/python/init.cpp`
+- Python infrastructure: `torch/fx/experimental/symbolic_shapes.py`
+- Other important files: `torch/_subclasses/fake_tensor.py`, `torch/_meta_registrations.py`, decomps, PrimTorch refs
+
+```{seealso}
+* {ref}`dynamic_shapes`
+* {ref}`dynamic_shapes_troubleshooting`
+```
--- a/docs/source/compile/dynamic_shapes_debugging_tlparse_torch_logs.md
+++ b/docs/source/compile/dynamic_shapes_debugging_tlparse_torch_logs.md
@ -0,0 +1,101 @@
+(debugging-tlparse-torch-logs)=
+# Debugging with `tlparse` and `TORCH_LOGS=dynamic`
+
+`tlparse` is a tool used for analyzing and understanding the compilation
+process in PyTorch, particularly when dealing with dynamic shapes. It helps
+identify where guards and specializations occur in your code.
+
+`TORCH_LOGS=dynamic` is an environment variable setting that enables detailed
+logging of dynamic shape operations, providing insights into how symbolic
+shapes are handled during execution.
+
+This section will guide you through using `tlparse` and `TORCH_LOGS=dynamic` to
+troubleshoot dynamic shape issues in your code, including debugging
+specialization, guards, and more.
+
+# Debugging Specialization
+
+In the following example, `x.shape[0]` is dynamic but becomes specialized due to multiplication:
+
+```python
+import torch
+
+@torch.compile
+def fn(x, y):
+    return x * y
+
+x = torch.randn(5)
+y = torch.randn(5)
+torch._dynamo.decorators.mark_dynamic(x, 0)
+
+fn(x, y)
+```
+
+By using `TORCH_LOGS=dynamic`, you can observe this specialization in the logs:
+
+```xml
+TORCH_LOGS=dynamic python tl.py
+I0721 11:10:00.950000 845259 torch/fx/experimental/symbolic_shapes.py:3776] [0/0] create_env
+I0721 11:10:01.030000 845259 torch/fx/experimental/symbolic_shapes.py:5117] [0/0] create_symbol s77 = 5 for L['x'].size()[0] [2, int_oo] return x * y  # tl.py:5 in fn (_dynamo/variables/builder.py:3466 in <lambda>), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s77" or to suppress this message run with TORCHDYNAMO_EXTENDED_ADVICE="0"
+I0721 11:10:01.038000 845259 torch/fx/experimental/symbolic_shapes.py:7211] [0/0] eval Eq(s77, 5) [guard added] return x * y  # tl.py:5 in fn (_subclasses/fake_impls.py:922 in infer_size), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(s77, 5)"
+```
+
+The line `eval Eq(s77, 5) [guard added] return x * y # tl.py:5` indicates the specialization.
+
+## Debugging Guards
+
+Consider the following code, which may cause recompilations due to dynamic
+shapes:
+
+```python
+import torch
+
+@torch.compile
+def fn(x, y):
+    if x.shape[0] < 10:
+        return x * y
+
+x = torch.randn(5)
+y = torch.randn(5)
+torch._dynamo.decorators.mark_dynamic(x, 0)
+torch._dynamo.decorators.mark_dynamic(y, 0)
+
+fn(x, y)
+```
+
+To identify where dynamic shape guards originate, use `tlparse`. Here is an example tlparse output:
+
+```{image} ../_static/img/dynamic_shapes/tlparse9_debugging_guards.png
+```
+
+By clicking on the `dynamo_cpp_guards` link, you can view all guards from the compilation, including the symbolic shape guard `L['x'].size()[0] <= 9`.
+
+Astute readers will notice the 0/1 specialization where we guard on `L['x'].size()[0] >= 2`. By modifying the code to use unbacked symbols, this guard is removed:
+
+```python
+import torch
+
+@torch.compile
+def fn(x, y):
+    # Necessary runtime assert since we can't guard on unbacked
+    torch._check(x.shape[0] < 10)
+    if x.shape[0] < 10:
+        return x * y
+
+x = torch.randn(5)
+y = torch.randn(5)
+torch._dynamo.decorators.mark_unbacked(x, 0)
+torch._dynamo.decorators.mark_unbacked(y, 0)
+
+fn(x, y)
+```
+
+Now, this compiled region can be used for inputs of size 0 and 1:
+
+```{image} ../_static/img/dynamic_shapes/tlparse10_debugging_guards_unbacked.png
+```
+
+```{seealso}
+* {ref}`dynamic_shapes`
+* {ref}`troubleshooting_guardondatadependentsymnode_errors`
+```
--- a/docs/source/compile/dynamic_shapes_troubleshooting.md
+++ b/docs/source/compile/dynamic_shapes_troubleshooting.md
@ -0,0 +1,14 @@
+(dynamic_shapes_troubleshooting)=
+
+# Troubleshooting Dynamic Shapes
+
+This section contains a list of common issues that you may encounter when using
+dynamic shapes. The section describes how to use `TORCH_LOGS` and `tlparse` to
+debug the issues, as well as provides some general tips and tricks to help you
+resolve the issues.
+
+```{toctree}
+:maxdepth: 1
+dynamic_shapes_debugging_tlparse_torch_logs
+dynamic_shapes_troubleshooting_guardon_errors
+```
--- a/docs/source/compile/dynamic_shapes_troubleshooting_guardon_errors.md
+++ b/docs/source/compile/dynamic_shapes_troubleshooting_guardon_errors.md
@ -0,0 +1,411 @@
+(troubleshooting_guardondatadependentsymnode_errors)=
+
+# Troubleshooting GuardOnDataDependentSymNode Errors
+
+When working with PyTorch models that have data-dependent control flow (using functions
+like `item()`, `tolist()`, or `nonzero())`, you may encounter `GuardOnDataDependentSymNode` errors.
+This section explains what these errors are and how to fix them.
+
+## Common Error Pattern
+The following output shows the common error pattern `GuardOnDataDependentSymNode` errors:
+
+```sh
+torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u2, -1) (unhinted: Eq(u2, -1)).  (Size-like symbols: none)
+
+Potential framework code culprit (scroll up for full backtrace):
+  File "/data/users/ezyang/a/pytorch/torch/_prims_common/__init__.py", line 855, in infer_size
+    if d == -1:
+
+For more information, run with TORCH_LOGS="dynamic"
+For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u2"
+If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
+For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing
+```
+
+## Root Cause
+
+These errors occur when PyTorch tries to convert a symbolic quantity (for example, `u2 == -1`)
+into a concrete value (such as, `False`) to make branching decisions. In a typical scenario,
+where data-dependent sizes are not involved, PyTorch can determine the concrete value at
+compile time and install a guard to ensure the compilation result remains valid. However,
+with data-dependent quantities, the true value is unknown at compile time, resulting in errors.
+
+You can often rewrite your model, by adding `torch._check` or `torch._check_is_size` to
+bypass these issues. This document aims to teach you how.
+
+## Debugging Tools
+
+Here is the list of some of the debugging tools available in PyTorch that you can use to troubleshoot these errors:
+
+* `TORCH_LOGS="dynamic"` - Shows detailed logs about symbolic operations
+* `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u2"` - Provides extended logs for specific symbols
+* `TORCHDYNAMO_EXTENDED_DEBUG_CPP=1` - Helps when guards are triggered from C++
+
+## Error Variations
+
+Here is a the list of error variations that you might encounter:
+
+| Error Variations | Description |
+|------------------|-------------|
+| "Could not guard on data-dependent expression" | Occurs when trying to extract a concrete boolean from expressions like u0 == 0 or u0 > 10 |
+| "Could not extract specialized integer from data-dependent expression" | Occurs when trying to extract a concrete integer value. <br/> **Common causes:** <br/> - Control flow that depends on the integer (such as, looping `u0` times) <br/> - Overspecialization in code that could work symbolically |
+
+## How to Diagnose Your Problem
+
+### Step 1: Examine the Potential Framework Culprit (Python Backtrace)
+
+The exception provides a backtrace, which often indicates the problem.
+Given that PT2 backtraces can be lengthy, the error message will also
+suggest a potential framework culprit. For example:
+
+```sh
+Potential framework code culprit (scroll up for full backtrace):
+  File "/data/users/ezyang/a/pytorch/torch/_prims_common/__init__.py", line 855, in infer_size
+    if d == -1:
+```
+
+**Consider the Following:**
+
+* Does it make sense that this condition is triggering a guard on a
+data-dependent symbol?
+* Should we know if the quantity in question is size-like?
+(The exception lists size-like symbols; if a symbol is not listed,
+it might be an arbitrary integer.)
+* If the equation involves two distinct symbols, should we know
+they are actually equal?
+*  If all symbols are size-like but the equation involves 0 or 1,
+are we missing a `guard_size_oblivious` wrapper? (Remember, for
+`guard_size_oblivious` between two size tuples, use `sym_eq` instead
+of regular equality.)
+
+In the example above, testing if `d` (a data-dependent value) is `-1` suggests
+that `d` should be non-negative if it were a size. This indicates a missing
+`torch._check_is_size`. If `d` is already size-like but `numel() == 0` fails,
+consider wrapping it in `guard_size_oblivious`.
+
+Using `TORCH_LOGS=dynamic` and examining the user stack trace is crucial for
+understanding how to fix the problem, as they guide you on how to modify the
+user program.
+
+```sh
+[INFO] create_unbacked_symint u0 [-9223372036854775808, 9223372036854775807] (w.py:40 in custom_op_meta)
+```
+
+This log message indicates where (`w.py:40`) the unbacked `SymInt` was
+allocated. An unbacked `SymInt` may be allocated multiple times, so track
+their equalities:
+
+```sh
+[INFO] set_replacement u1 = u0 (trivial_lhs) ValueRanges(lower=0, upper=9223372036854775807, is_bool=False)
+```
+
+### Step 2: Examine the C++ Backtrace
+
+If the framework code culprit is uninformative, the guard might be in C++. You can
+force a C++ backtrace by running with `TORCHDYNAMO_EXTENDED_DEBUG_CPP=1`. This
+provides a detailed C++ backtrace with Python, CPython, and C10/ATen/libtorch
+frames interspersed. Look for symbols in the `at::` or `c10::` namespace that
+resemble kernel-specific code, likely related to the kernel executed per the Python
+backtrace. If using a non-debug build of PyTorch, inlining may cause missing
+frames, requiring source code investigation to locate the issue. For example, see https://github.com/pytorch/pytorch/pull/118579.
+
+Here is an example C++ backtrace from a debugging session:
+
+```
+[2024-02-08 08:20:45,259] torch.fx.experimental.symbolic_shapes: [INFO]   File "../
+__gen_aten__/out/RegisterCompositeImplicitAutograd.cpp", line 2025, in at::
+(anonymous namespace)::(anonymous namespace)
+::wrapper_CompositeImplicitAutograd_Tensor_narrow(at::Tensor const&, long,
+at::Tensor const&, c10::SymInt) [2024-02-08 08:20:45,259] torch.fx.experimental.
+symbolic_shapes: [INFO]   File "../aten/src/ATen/native/TensorShape.cpp", line 1410,
+in at::native::narrow_tensor_symint(at::Tensor const&, long, at::Tensor const&,
+c10::SymInt) [2024-02-08 08:20:45,259] torch.fx.experimental.symbolic_shapes:
+[INFO]   File "../__gen_aten__/out/core/TensorMethods.cpp", line 52, in long
+at::Tensor::item<long>() const [2024-02-08 08:20:45,259] torch.fx.experimental.
+symbolic_shapes: [INFO]   File "../ATen/core/TensorBody.h", line 4274, in
+at::Tensor::item() const
+```
+
+In this example, `at::native::narrow_tensor_symint` calls into `item`, which
+triggers the guard on a data-dependent `SymNode`. You can modify the C++ code to
+avoid specializing, or verify if you should be in this C++ code (e.g., `start` was
+not expected to be a `Tensor`, and modifying this fixed the problem).
+
+## Tools for Fixing Errors
+
+There are a few important functions which you should use to troubleshoot this problem.
+
+### torch._check(cond, msg_fn)
+
+`torch._check` is a function used to assert conditions at runtime, particularly when dealing with symbolic integers (`SymInts`) in PyTorch.
+
+**Example Usage:**
+
+```python
+torch._check(x.size(0) == y, lambda: f"size mismatch: {x.size(0)} != {y}")
+```
+
+The code above does the following:
+
+* Creates a deferred runtime assertion instead of a compile-time guard
+* Teaches the symbolic reasoning system facts about your unbacked SymInts
+* Can eliminate unbacked symbols by replacing them with equivalent expressions
+* Refines value ranges of symbols
+* Remembers boolean expressions that are always true
+
+Semantically, the function behaves like a conditional check:
+```python
+if not cond:
+    raise RuntimeError(msg_fn())
+```
+
+But there a number of key differences:
+
+* The condition is always assumed true at compile time, even if it involves unbacked `SymInts`. The actual check is deferred to runtime, avoiding
+compile-time errors. Instead of setting up a guard, we implement a
+deferred runtime assertion to verify the condition at runtime. At compile
+time, we assume the condition won't trigger an error, so we don't need
+to determine if it evaluates to `True` or `False`.
+
+* If you perform an equality test `u0 = RHS`, we try to replace all instances
+of `u0` with RHS. We will ALWAYS do this if RHS has no unbacked symbols,
+as removing unbacked symbols is beneficial—eliminating them prevents
+the creation of a `GuardOnDataDependentSymNode`. Even if we are not able
+to eliminate u0, we can refine its value range. The value range specifies
+what the set of possible values for a variable are. By default, size-like
+unbacked SymInts have a value range of `[0, Inf]`; if you assert it is
+equal to an expression with a refined value range, say `[2, 20]`, then
+`u0`’s value range will be updated to `[2, 20]`. We also have limited
+support for propagating value ranges in reverse.
+
+* If you perform a boolean test `f(u0)`, we will remember that this expression always evaluates to True, and if you evaluate an expression that contains this expression, we will substitute it with True. We also support some limited reasoning on logically equivalent statements. For example, if you `torch._check(u0 < 4)`, we will also know that `u0 >= 4` evaluates to `False`, and so performing a test like this in a normal non-check conditional will go through fine.
+
+
+### `torch._check_is_size(size)` and `guard_size_oblivious(cond)`
+
+Example:
+```python
+u0 = y.item()
+torch._check_is_size(u0)
+```
+
+**Semantic Equivalent:**
+
+```python
+if u0 < 0:
+    raise RuntimeError("u0 is not a size")
+```
+
+**Key Differences:**
+
+Like `torch._check`, this test will always succeed at compile time, and it will establish that `u0 >= 0`. This refines the value range of `u0` to `[0, Inf]` instead of `[-Inf, Inf]`.
+
+Marking `u0` as size-like is crucial. Size-like unbacked `SymInts` behave like
+their regular counterparts, except when involved in a boolean expression
+evaluated with `guard_size_oblivious`. In such cases, they are assumed not to equal zero or one, temporarily setting their value range to `[2, Inf]`. For instance, a conditional check like `u0 == 1` will evaluate to `False` when `u0` is size-like, instead of causing an error.
+
+For example, `guard_size_oblivious(u0 == 1)` will always return `False` when `u0`
+is size-like.
+
+Marking unbacked symbols as size-like is essential in contexts where tensor
+sizes are expected. PyTorch internals often check if sizes are zero or one to
+handle special cases related to empty or single-element tensors. If you pass an
+unbacked symbol to a factory function like `torch.empty`, it will automatically
+be marked as size-like. However, some quantities, like arguments to `Tensor.view`,
+cannot be inferred as size-like because `-1` is a valid argument. In such cases,
+you need to explicitly use `torch._check_is_size` on an unbacked `SymInt` before
+passing it to `view`.
+
+In PyTorch framework code, if you need to test a size for zero or one, wrap the
+test in `guard_size_oblivious` to assume that size-like unbacked `SymInts` will
+not pass this test. Generally, most framework code has logic for the `>= 2`
+case, which works for the `0/1` case. If using `guard_size_oblivious` in
+PyTorch framework code resolves your issue, it's likely acceptable. However,
+avoid using `guard_size_oblivious` in user code, especially if different
+behavior is required for the `0/1` case at runtime, such as in a
+hand-tracking application.
+
+In C++, this can be done with `TORCH_GUARD_SIZE_OBLIVIOUS(u0.sym_eq(0))`, for example.
+
+### torch._check_is_size(size, max=upper_bound) (New)
+
+This function is semantically equivalent to `torch._check(size <= upper_bound)`.
+However, under `guard_size_oblivious`, it assumes that `size < upper_bound`.
+This functionality only works when the upper bound is an integer constant. If
+`upper_bound` is a symbolic expression, normal semantics apply. There is
+potential to extend this functionality to symbolic expressions with further
+development.
+
+For more details, see the related issue https://github.com/pytorch/pytorch/issues/120288.
+
+
+### `torch._constrain_as_value` and `torch._constrain_as_size`
+
+These APIs are more specialized and are effectively equivalent to
+`torch._check` and `torch._check_is_size`, with the added capability
+of adjusting the value range of a variable by specifying minimum and
+maximum values. However, in recommendation models, these functions are
+unlikely to resolve `GuardOnDataDependentSymNode` errors effectively.
+
+While `constrain_as_value` might seem like a convenient way to ensure a
+variable stays within the bounds of another tensor, it is often impractical.
+This is because value ranges only support constant bounds, and it's common
+for the tensor you want to index into to have a symbolic dimension (for
+example, `s0`). Using its size as the maximum value for a value range
+will force specialization, which is usually undesirable. Instead, if
+necessary, manually handle range checks by using `torch._check()` on
+appropriate expressions based on the errors you encounter.
+
+## Common Fix Patterns
+
+There are several common methods to resolve issues like this. Below,
+we outline the most frequently used solutions.
+
+### When It's Unfixable
+
+In some cases, the issue is genuinely unfixable due to the nature of the code.
+Consider the following example:
+
+```python
+i = x.item()
+if i > 4:
+  return x * 2
+else:
+  return x + 3
+```
+
+If the user code is branching on a data-dependent value, it is impossible to
+trace as is. In such cases, you may need to consider alternative approaches,
+such as using `torch.cond`.
+
+Another common pattern involves indexing with a data-dependent value:
+
+```python
+return self.mlps[x.item()]
+```
+
+Here, `self.mlps` is a Python list or `ModuleList`, and the code branches on a data-dependent value. The simplest solution is to induce a graph break before the indexing operation.
+
+### `u0` is a Size, but We Don’t Know It
+
+Some guards fail on tests that essentially ask, "Is this a size?" but we don't know it is a size. These fall into two categories:
+
+1. **Regular Tests:**
+
+   These are tests like `u0 >= 0` or `u0 != -1` that are unconditionally true
+   for sizes. Adding a `torch._check_is_size(...)` on the relevant size will
+   assert that these tests are true. This is typically uncommon because if
+   the test is for error checking, we can infer that the condition must be
+   true, as an error would occur otherwise. An important exception is APIs
+   that accept both sizes and `-1`; in such cases, the user must indicate that
+   the input data-dependent quantity cannot be `-1`, as something unusual would
+   happen otherwise. For an example, see
+   https://github.com/pytorch/pytorch/pull/107788.
+
+   Sometimes, you can refactor an error-checking API to split a logical
+   disjunction of conditionals into separate conditionals. If you can do so
+   to achieve a single `torch._check(x == y)` statement, it will enable
+   the automatic generation of a deferred runtime assertion. For an example,
+   see https://github.com/pytorch/pytorch/pull/110979.
+
+2. **Edge Case Tests:**
+
+   These are tests like `u0 == 0` or `u0 == 1`, which are not always true for
+   sizes, but where our choice doesn’t really matter. These tests handle edge
+   cases, such as dealing with an empty tensor or testing for broadcasting when
+   we want to assume broadcasting is not occurring. To resolve these situations,
+   two steps are needed:
+
+   * First, the guard itself must be evaluated via `guard_size_oblivious`,
+   which assumes that size-like integers cannot equal zero or one, with the
+   promise that if they do, something reasonable will happen.
+   * Second, the symbols themselves must be marked as size-like, either
+   inferred because they were passed to tensor factory functions or explicitly
+   specified with `torch._check_is_size(...)`. For examples of making guards
+   size-oblivious, see https://github.com/pytorch/pytorch/pull/118579.
+
+Sometimes, these tests can occur in C++. While there are corresponding
+C++ APIs for these tests, it can be more challenging to localize the problem,
+as you do not get a useful backtrace by default.
+
+### `u0` is Actually Equal to `u1`, but We Don’t Know It
+
+Multiple unbacked `SymInts` can be known to be equal at compile time:
+
+```python
+i0 = x.sum().item()
+i1 = x.sum().item()
+return torch.randn(i0) + torch.randn(i1)
+```
+
+If there is a `torch._check(i0 == i1)` somewhere (in the example above, this
+check would occur inside the shape-checking rule for addition), we will
+automatically unify the two unbacked `SymInts` and recognize them as equal.
+However, if such an assertion is missing, you may need to explicitly add an
+assertion to achieve this unification. For an example, see
+https://github.com/pytorch/pytorch/issues/111950).
+
+```{note}
+If we allocate an unbacked `SymInt` and
+immediately set it equal to another, these instances are benign and not easily
+eliminated entirely from the framework.
+```
+
+### `u0` is a Tensor
+
+Another reason you might be overallocating unbacked `SymInts` is due to passing
+around a `Tensor` and relying on its implicit conversion to an integer. Many
+functions that accept an integer will also accept a `Tensor` and automatically
+call `item()` on the integer argument. It's beneficial to examine
+`TORCH_LOGS=dynamic` to determine whether the number of unbacked `SymInts` is
+as expected or excessive. When this occurs, a new `SymInt` will be allocated at
+the line where a PyTorch function is invoked.
+
+This issue is less likely to cause problems now because the return value of
+`t.item()` is memoized, ensuring that you consistently receive the same unbacked
+`SymInt` if you call it multiple times.
+
+### Overspecialization Issue
+
+In non-strict export mode, consider the following code:
+
+```python
+u0 = x.sum().item() return y[:u0]
+```
+
+This code will fail when trying to evaluate `u0` because, when a `SymInt` is
+used directly inside a Python slice (without using Dynamo), Python forces the
+integer to be specialized and fails if it is unbacked.
+
+To resolve this, you can rewrite the program to avoid specialization.
+For the example above, you can fix it by not using slices:
+
+```python
+u0 = x.sum().item() return y.narrow(0, 0, u0)
+```
+
+For more details, see the related issue
+https://github.com/pytorch/pytorch/issues/111950.
+
+### Use Lengths Instead of Offsets
+
+When working with variable sequence lengths, it's common to have tensors
+representing either the lengths or offsets of the sequences. For example, given
+`values = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]`, you might have `lengths = [3, 2, 4]`
+and `offsets = [0, 3, 5, 9]`. While these representations are interconvertible,
+it's better to work with lengths when dealing with them as integers (by calling
+`lengths.tolist()`), rather than offsets.
+
+The reason is that when you perform a `torch.split()` on your `values` tensor, you
+need to create tensors for each sub-sequence, such as tensors of sizes 3, 2, and 4.
+If you have unbacked `SymInts` for sizes, they become `u0`, `u1`, and `u2`. You can
+easily indicate that they are size-like, and you're done. However, if you have
+unbacked `SymInts` for offsets, they become `u1 - u0`, `u2 - u1`, `u3 - u2`, which
+complicates matters. These quantities cannot be conveniently marked as size-like,
+leading to potential issues. Since it's relatively straightforward to write code
+using either lengths or offsets, you should prefer using lengths.
+
+```{seealso}
+* {ref}`dynamic_shapes`
+* {ref}`debugging-tlparse-torch-logs`
+```
--- a/docs/source/compile/dynamic_shapes_zero_one_specialization.md
+++ b/docs/source/compile/dynamic_shapes_zero_one_specialization.md
@ -0,0 +1,33 @@
+(zero-one-specialization)=
+# The Zero-One Specialization Problem
+
+Before you read this section, you should understand the basics of
+dynamic shapes. Make sure you have read the following sections:
+
+* {ref}`dynamic_shapes`
+* {ref}`torch.export`
+* {ref}`what_is_a_specialization`
+
+In `torch.compile`, we specialize automatically on inputs with sizes
+0 or 1 and assume that any remaining inputs cannot be 0 or 1. This
+simplifies tasks like contiguity and broadcasting checks, as it
+avoids adding extra guards. However, this can cause problems for
+sparse models with many symbolic integers that in practice have
+tensors of size 0, 1, or 2. For example, consider when you a task is
+something like collecting likes on page.
+
+While it's possible to stop specializing on 0/1 upfront, executing
+normal PyTorch code often reintroduces 0/1 guards, as many conditions
+in PyTorch check for values being 0 or 1. Although models that work
+for `N > 2` often generalize to `N = 1`, this isn't guaranteed, especially
+with symbolic variables. For example, in hand tracking, a dimension
+size of `N = 0`, `1`, or `2` may lead to different graph behaviors.
+Simply hoping that the `N > 2` model generalizes can expose soundness issues.
+
+
+```{seealso}
+* {ref}`dynamic_shapes`
+* {ref}`torch.export`
+* {ref}`what_is_a_specialization`
+* {ref}`backed-vs-unbacked-symints`
+```
--- a/docs/source/torch.compiler.md
+++ b/docs/source/torch.compiler.md
@ -82,55 +82,48 @@ Some of the most commonly used backends include:

 ## Read More

-```{eval-rst}
-.. toctree::
-   :caption: Getting Started for PyTorch Users
-   :maxdepth: 1
+```{toctree}
+:caption: Getting Started for PyTorch Users
+:maxdepth: 2

-   torch.compiler_get_started
-   torch.compiler_api
-   torch.compiler.config
-   torch.compiler_fine_grain_apis
-   torch.compiler_backward
-   torch.compiler_aot_inductor
-   torch.compiler_inductor_profiling
-   torch.compiler_profiling_torch_compile
-   torch.compiler_faq
-   torch.compiler_troubleshooting
-   torch.compiler_performance_dashboard
-   torch.compiler_inductor_provenance
+torch.compiler_get_started
+torch.compiler_api
+torch.compiler.config
+torch.compiler_dynamic_shapes
+torch.compiler_fine_grain_apis
+torch.compiler_backward
+torch.compiler_aot_inductor
+torch.compiler_inductor_profiling
+torch.compiler_profiling_torch_compile
+torch.compiler_faq
+torch.compiler_troubleshooting
+torch.compiler_performance_dashboard
+torch.compiler_inductor_provenance
 ```

-```{eval-rst}
-.. toctree::
-   :caption: `torch.compile` Programming Model
+```{toctree}
+:caption: torch.compile Programming Model
+:maxdepth: 2

-   compile/programming_model
+compile/programming_model
 ```

-% _If you want to contribute a developer-level topic
-%  that provides in-depth overview of a torch._dynamo feature,
-%  add in the below toc.
+```{toctree}
+:caption: Deep Dive for PyTorch Developers
+:maxdepth: 1

-```{eval-rst}
-.. toctree::
-   :caption: Deep Dive for PyTorch Developers
-   :maxdepth: 1
-
-   torch.compiler_dynamo_overview
-   torch.compiler_dynamo_deepdive
-   torch.compiler_dynamic_shapes
-   torch.compiler_nn_module
-   torch.compiler_cudagraph_trees
-   torch.compiler_fake_tensor
+torch.compiler_dynamo_overview
+torch.compiler_dynamo_deepdive
+torch.compiler_nn_module
+torch.compiler_cudagraph_trees
+torch.compiler_fake_tensor
 ```

-```{eval-rst}
-.. toctree::
-   :caption: HowTo for PyTorch Backend Vendors
-   :maxdepth: 1
+```{toctree}
+:caption: HowTo for PyTorch Backend Vendors
+:maxdepth: 1

-   torch.compiler_custom_backends
-   torch.compiler_transformations
-   torch.compiler_ir
+torch.compiler_custom_backends
+torch.compiler_transformations
+torch.compiler_ir
 ```
--- a/docs/source/torch.compiler_dynamic_shapes.md
+++ b/docs/source/torch.compiler_dynamic_shapes.md
@ -1,129 +1,295 @@
-# Dynamic Shapes
+---
+file_format: mystnb
+kernelspec:
+  name: python3
+mystnb:
+  execution_timeout: 30
+  execution_show_tb: True
+  merge_streams: True
+---

-Code: [symbolic_shapes.py](https://github.com/pytorch/pytorch/blob/db4572dbf18f1cf50cf662547e272d3117063747/torch/fx/experimental/symbolic_shapes.py)
+```{code-cell}
+:tags: [remove-cell]
+import torch
+from compile import header_code

-See also: [The dynamic shapes manual](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.fh8zzonyw8ng)
-
-## Motivation
-
-Deep learning compilers commonly only work for static shapes, that is to say, they produced compiled programs which only work for a single specific configuration of input shapes, and must recompile if any input shape changes. This assumption works great for the majority of commonly run deep learning models today, but there are a few situations where it is insufficient:
-
- Some dimensions, such as batch size or sequence length, may vary. For example, an inference service performing adaptive batching will execute inference requests with varying batch sizes depending on how many requests it received within its batching window. We may also want to consider padding out variable size sequences only to the maximum sequence length within a batch, which may vary from batch-to-batch.
- Some models exhibit data-dependent output shapes, that is to say, the size of their outputs and intermediates may depend on the actual input data which may vary across runs. For example, detection models may first generate a variable number of potential bounding boxes before running a more expensive image recognition model to identify if the subject is in a bounding box. The number of bounding boxes is data dependent.
- One particularly important case of data-dependent shapes occurs when dealing with sparse representations, such as sparse tensors, jagged tensors, and graph neural networks. In all of these cases, the amount of data to be processed depends on the sparse structure of the problem, which will typically vary in a data-dependent way.
-
-In supporting dynamic shapes, we chose not to support dynamic rank programs, e.g., programs whose inputs tensors change in dimensionality, as this pattern rarely occurs in real-world deep learning programs, and it avoids the need to reason inductively over symbolic lists of shapes.
-
-## Abridged public API
-
-The default dynamic behavior in PyTorch 2.1 is:
-
- PT2 assumes everything is static by default
- If we recompile because a size changed, we will instead attempt to recompile
-  that size as being dynamic (sizes that have changed are likely to change in
-  the future). This generalization may fail (e.g., because user code does a
-  conditional branch on the size in question or missing dynamic shapes support
-  in PT2). If you are trying to understand why PT2 has overspecialized some
-  code, run with `TORCH_LOGS=dynamic` and look for "eval" entries that say
-  when guards are added and why.
- If you know ahead of time something will be dynamic, you can skip the first
-  recompile with `torch._dynamo.mark_dynamic(tensor, dim)`. If you know ahead of time
-  the `min` and `max` value this dimension can take, you can specify `torch._dynamo.mark_dynamic(tensor, dim, min=min, max=max)`
- If you say `torch.compile(dynamic=False)`, we will turn off automatic
-  dynamic shapes on recompiles and always recompile for each distinct size.
-  Conversely, if you say `torch.compile(dynamic=True)`, we will try to make
-  everything as dynamic as possible. This is mostly useful for small
-  operators; if you try it on a big model it will (1) probably crash PT2 and (2) run slow for no good reason.
- You can whitelist specific sources to be marked as dynamic using the
-  `TORCH_COMPILE_DYNAMIC_SOURCES` environment variable or by setting
-  `torch.compiler.config.dynamic_sources`. This is particularly useful for large
-  models with graph breaks, as you can maintain dynamism across graph breaks since
-  source names stay consistent. You can also use this to mark integers as dynamic.
-  The format is a comma-delimited list of source names, e.g., `"L['x'], L['y']"`.
-  You can also use regexes, e.g., `"L\['x.*'\], L\['y.*'\]")`.
-  This whitelist takes precedence over other flags like `dynamic=False`,
-  `force_nn_module_property_static_shapes`, and `force_parameter_static_shapes`.
- Sometimes it can be cumbersome to find the right inputs to mark as dynamic. If
-  you're willing to take a performance hit for the first batch, one other affordable
-  option we have are the eager_then_compile stances which derive dynamism for you.
-  See [torch.compiler.set_stance](https://docs.pytorch.org/docs/stable/generated/torch.compiler.set_stance.html) for more details.
-
-## The Guard Model
-
-When considering how to add support for dynamic shapes to TorchDynamo and TorchInductor, we made a major design decision: in order to reuse decompositions and other preexisting code written in Python/C++ targeting the PyTorch API, we must be able to trace through dynamic shapes. Unlike a fully symbolic system which might capture both branches of a conditional, we always pick one branch and specialize our trace under the assumption that we only use this trace when we would have made the same choice for that branch in the future. To do this, we maintain a "hint" for every symbolic size saying what its concrete value is at compile time (as TorchDynamo is a just-in-time compiler, it always knows what the actual input sizes are.) When we perform a condition on a tensor, we simply consult the hint to find out which branch to take.
-
-This greatly simplifies the symbolic shape formulas we produce, but means we have a much more involved system for managing guards. Consider, for example, the following program:
-
-```python
-def f(x, y):
-    z = torch.cat([x, y])
-    if z.size(0) > 2:
-        return z.mul(2)
-    else:
-        return z.add(2)
+torch._logging.set_logs(graph_breaks=True, graph_code=True)
 ```

-The final IR we will compile with TorchInductor will either be `torch.cat([x, y]).add(2)` or `torch.cat([x, y]).mul(2)` (with the condition flattened away), but to determine which branch we are in, we would need to know the size of `z`, an intermediate. Because TorchDynamo must know upfront if a compiled trace is valid (we do not support bailouts, like some JIT compilers), we must be able to reduce `z.size(0)` as an expression in terms of the inputs, `x.size(0) + y.size(0)`. This is done by writing meta functions for all operators in PyTorch which can propagate size information to the output of a tensor without actually performing computation on the node.
+(dynamic_shapes)=
+# Dynamic Shapes

-## Overall architecture
+This section explains how to work with dynamic shapes in PyTorch, including how
+to debug and fix common errors, implement support for dynamic shapes in
+operators, and understand the underlying mechanisms.

-Symbolic shapes workflow:
+Dynamic shapes allow PyTorch models to handle inputs with varying dimensions
+without recompilation. This enables more flexible models that can process
+different batch sizes, sequence lengths, or image dimensions in a single
+compiled artifact. Dynamic shapes work by symbolically tracing tensor
+dimensions rather than using concrete values, creating a computation
+graph that adapts to different input shapes at runtime. By default,
+PyTorch assumes all input shapes to be static.

-1. When we start compiling a frame in Dynamo, we allocate a ShapeEnv (attached to FakeTensorMode) which keeps track of symbolic shapes state.
-2. We allocate symbolic sizes for tensors on entry (what is static or dynamic is a policy decision, with some knobs).
-3. We propagate the symbolic sizes through operators, maintaining both (1) FX IR so that we can faithfully export symbolic compute, and (2) Sympy expressions representing the size vars, so we can reason about them.
-4. When we condition on symbolic sizes, either in Dynamo tracing or in Inductor optimization, we add guards based on the conditional. These can be induced from both Python and C++.
-5. These guards can induce further simplifications on symbolic variables. For example, if you assert `s0 == 4`, we can now replace all occurrences of `s0` with `4`.
-6. When we're done tracing and optimizing, we install all of these guards with the compiled code; the compiled code is only reusable if all the guards evaluate true.
+Typically, deep learning compilers only support static shapes, requiring
+recompilation for input shape changes. While this approach covers many use cases,
+there are situations where this is insufficient:

-Important files:
+- **Variable Dimensions** - Batch sizes or sequence lengths vary, such as in
+adaptive batching.
+- **Data-Dependent Outputs** - Models produce outputs based on input data,
+like variable bounding boxes in detection models.
+- **Sparse Representations** - Processing depends on data-varying sparse structures,
+such as in sparse tensors, jagged tensors, and graph neural networks.

- C++ SymInt API: `c10/core/SymInt.h`, `SymFloat.h`, `SymBool.h`
- Python SymInt API: `torch/__init__.py` (look for `SymInt/SymFloat/SymBool`)
- C++ plumbing: `c10/core/SymNodeImpl.h`, `torch/csrc/utils/python_symnode.h`, `torch/csrc/jit/python/init.cpp`
- Python infrastructure: `torch/fx/experimental/symbolic_shapes.py`
- Other important files: `torch/_subclasses/fake_tensor.py`, `torch/_meta_registrations.py`, decomps, PrimTorch refs
+Dynamic shapes do not support dynamic rank programs, programs which input tensors
+change in dimensionality, as this is uncommon and unnecessarily complex.

-## Abridged internal API

-Understanding the Python class hierarchy:
+## What does it mean for a size/integer to be dynamic?

- SymInt/SymFloat/SymBool: these are user-visible classes that simulate their int/float/bool counterparts. If you add two SymInts, we give you a new SymInt that symbolically tracks that the integer addition had occurred.
- SymNode: this is the internal structure (accessible via e.g., `symint.node`) which holds the actual symbolic tracking info. SymNode is type erased; this makes it more convenient to represent mixed-type operations. Note that technically you don't have to call into Python SymNode from SymInt; for example, XLA's C++ `SymNodeImpl` would take the place of SymNode.
- ShapeEnv: per-compile context state which keeps track of all the free symbols and guards we have accumulated so far. Every SymNode records its ShapeEnv (but not vice versa; SymNodes only get used if they participate in a guard).
+Dynamic shapes allow avoiding recompilations by making certain dimensions or integers
+dynamic. For example, if a function `f(x)` is compiled with a static size, it will need
+recompilation for different sizes:

-C++ is fairly similar:
+```{note}
+For simplicity, this example uses `@torch.compile(dynamic=True)`. Note, that
+this option is not recommended due to it being error prone.
+For a recommended way of enabling dynamic shapes, see {ref}`enable-dynamic-behavior`.
+```

- c10::SymInt/SymFloat/SymBool: user-visible classes that simulate int/float/bool.
- c10::SymNode/SymNodeImpl: analogous to SymNode
- There is no ShapeEnv in C++; for ease of debugging, the entire symbolic reasoning apparatus is in Python.
+```{code-cell}
+import torch
+@torch.compile(dynamic=False)
+def f(x):
+     return x* x.size()[0]

-When you write code that is traceable with `make_fx`, it must be able to deal with SymInt/SymFloat/SymBool flowing through it. [The dynamic shapes manual](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.fh8zzonyw8ng) gives some guidance for how to do this.
+f(torch.rand(10))
+f(torch.rand(20))
+f(torch.rand(30))
+f(torch.rand(40))
+```

-## DimDynamic policy
+In the produced output, you can see that four graphs were generated.
+See the corresponding <a href="_static/img/dynamic_shapes/tlparse1_dynamic_shapes_false.png" target="_blank">tlparse output</a>

-Symbolic reasoning:
+By making the size dynamic, the function can handle various sizes without recompilation:

- Value ranges
- Sympy usage notes
- Constraints
- DimDynamic/Constraint
+```{code-cell}
+import torch
+@torch.compile(dynamic=True)
+def f(x):
+     return x* x.size()[0]

-## Unbacked SymInts
+f(torch.rand(10))
+f(torch.rand(20))
+f(torch.rand(30))
+f(torch.rand(40))
+```

-To resolve control flow, we check the hint, aka actual value, of a symbolic integer to determine which branch to go. However, in some cases, we may not have a hint: so-called unbacked symbolic integers arise when a size variable emerges from a data-dependent operation like `.nonzero()` or `.item()`. It is illegal to perform control flow on these symbolic integers, so we must graph break on these operations.
+With dynamic shapes enabled, only one graph is created. See the
+corresponding <a href="_static/img/dynamic_shapes/tlparse2_dynamic_shapes_true.png" target="_blank">tlparse output</a>.

-Naively implemented, this is too restrictive: most PyTorch programs will immediately fail if you try to do anything with unbacked symbolic integers. Here are the most important enhancements to make this actually work:
+While compilation time differences
+are minimal for this small example, more complex use cases would show significant
+performance improvements.

- On tensor creation, PyTorch precomputes a lot of data about a tensor; for example, if you use `empty_strided` to create a tensor, we will eagerly sort the strides and determine if the tensor is non-overlapping and dense. Sorts produce a lot of guards. However, it is more common to produce a tensor directly with a higher-level API like `empty`, which is guaranteed to produce a non-overlapping and dense tensor. We modified PyTorch to avoid needlessly recomputing these properties.
- Even if nontrivial compute is needed, sometimes a property is never actually queried at all. Making these precomputed properties lazy allows us to avoid guarding on an unbacked symbolic integer unless it is actually needed.
- The data in an integer tensor is generally not known to be non-negative. However, we provide an API `constrain_range` whereby a user can specify that a size is bounded above and below by known limits.
+(what_is_a_specialization)=
+## What is a specialization?

-Similar to the dynamic APIs, there are corresponding unbacked APIs: namely you can use mark_unbacked instead of `mark_dynamic` and `TORCH_COMPILE_UNBACKED_SOURCES` instead of `TORCH_COMPILE_DYNAMIC_SOURCES` to tell the compiler to mark an input as unbacked.
+**Specialization** refers to optimizing a computational graph for specific input shapes
+by examining shape conditions during control flow. If a branch is taken based on a
+shape condition, the graph is tailored for that condition. If a new input doesn't meet
+this condition, the system will recompile the graph.

-In future versions of PT2 (beyond PT2.1), we will extend our reasoning system
-to infer that an unbacked symbolic integer is size-like based on usage. For
-example, if you pass the result of an `.item()` call to a factory function
-like `torch.empty`, we will automatically infer that the result is a size
-(because if it was not, it would fail.) This assumption would get validated
-at runtime, raising an error if it was not fulfilled.
+Specialization allows you to create optimized computational graphs for specific input
+shapes, which can significantly improve execution speed.
+
+
+```{code-cell}
+import torch
+@torch.compile(dynamic=True)
+def f(x):
+    if x.size()[0] == 10:
+        return x * 10
+
+    if x.size()[0] <= 30:
+        return x*200
+
+    return x*x.size()[0]
+
+f(torch.rand(10))
+f(torch.rand(20))
+f(torch.rand(30))
+f(torch.rand(40))
+f(torch.rand(50))
+```
+
+In the code above, we specialize that the graph requires an input size of 10, in which
+case it will return `x * 10`. If the input size is less than 30, it will return `x * 200`.
+In the output, you can see that this creates three graphs.
+
+See the corresponding <a href="_static/img/dynamic_shapes/tlparse3_specialization.png" target="_blank">tlparse output</a>
+
+
+This is how graphs created for the above function:
+
+```{image} _static/img/dynamic_shapes/dynamic_shapes_example_specialization.png
+```
+
+(enable-dynamic-behavior)=
+## Enabling Dynamic Behavior
+
+There are the following ways to make things dynamic:
+
+* {ref}`automatic_dynamic`
+* {ref}`user_annotations` (preferred)
+* {ref}`torch_compile_dynamic_true` (for testing only)
+* {ref}`dynamic_shapes_advanced_control_options` (for advanced use cases)
+
+Read below about each of this options.
+
+(automatic_dynamic)=
+### Automatic dynamic
+
+**Automatic dynamic** is the default behavior where {func}`torch.compile` performs
+the initial compilation assuming static shapes are used, while tracking the
+input sizes from that first compilation. When a recompile is triggered, it
+uses this information to identify which dimensions have changed and marks
+those as dynamic for the second compilation.
+
+(user_annotations)=
+### User Annotations
+
+Several APIs allow users to explicitly mark specific inputs
+by name or code as dynamic. This is useful for avoiding initial compilations that
+would eventually become dynamic with the previous tools. It is also used to mark
+elements that do not automatically get marked as dynamic, such as neural network
+module parameters, and so on. User annotations are the preferred way to enable
+dynamic shapes.
+
+#### `mark_dynamic(tensor, dim, min=min, max=max)`
+
+The {func}`torch._dynamo.mark_dynamic` function marks a tensor dimension as dynamic and will fail if it
+gets specialized. It does not work for integers. Use this function only if you know
+all graphs in the frame using this input converge to a single dynamic graph.
+Otherwise, you may encounter a misleading constraint violation error.
+In such cases, consider using {func}`torch._dynamo.maybe_mark_dynamic`. Currently,
+{func}`torch._dynamo.mark_dynamic`
+does not have precedence over `force_parameter_static_shapes = True` or `force_nn_module_property_static_shapes = True`.
+
+If you know in advance that a particular dimension will be dynamic, you
+can avoid the initial recompilation by using {func}`torch._dynamo.mark_dynamic(tensor, dim)`.
+Additionally, if you already know the minimum and maximum possible
+values for this dimension, you can specify them with
+{func}`torch._dynamo.mark_dynamic(tensor, dim, min=min, max=max)`.
+
+Here is a quick example:
+
+```{code-cell}
+import torch
+
+@torch.compile(dynamic=True)
+def f(x):
+    return x * x.size()[0]
+
+x = torch.randn(10)
+torch._dynamo.mark_dynamic(x, 0)
+
+# first invocation we give it is a tensor marked as dynamic
+f(x)
+# rest of these invocations will use dynamically compiled code
+f(torch.randn(20))
+f(torch.randn(30))
+f(torch.randn(40))
+```
+
+#### `maybe_mark_dynamic(tensor, dim)`
+
+The {func}`torch._dynamo.maybe_mark_dynamic` function shares all properties
+with  {func}`torch._dynamo.mark_dynamic`
+but does not fail if the size gets specialized. Use it for inputs shared by
+multiple graphs or if the number of graphs does not converge to one for a specific
+frame. For instance, in the example above, use {func}`torch._dynamo.maybe_mark_dynamic()` because graphs
+with sizes 0 and 1 will specialize. However, you can use {func}`torch._dynamo.mark_dynamic` to ensure
+you never specialize.
+
+#### `mark_unbacked(tensor, dim)`
+
+The {func}`torch._dynamo.mark_unbacked` function marks a tensor dimension as unbacked. It is unlikely
+to be the tool you need, but it could be useful if the specialization occurs inside
+a condition `guard_size_oblivious(x)`, and if using it removes the specialization.
+Ensure it fixes the specialization and does not introduce a data-dependent error
+that converts to a graph break at or before the specialization location
+you are trying to  avoid. It might be better to use the next option.
+
+(dynamic_sources_allow_list)=
+#### Dynamic Allow List (`DYNAMIC_SOURCES`)
+
+Use the evnironmental variable `TORCH_COMPILE_DYNAMIC_SOURCES` to pass a configuration
+list of source names to be marked as dynamic. For example:
+`TORCH_COMPILE_DYNAMIC_SOURCES=L[‘x’],L[‘y’]`
+It's easiest to find these dynamic source names using the PGO artifact in `tlparse`.
+You can copy and paste the dynamic source names from the PGO artifact. This method works
+for integers and tensor sizes and has the highest precedence over all other flags
+that force static shapes. It will not throw an error if what is marked dynamic
+gets specialized or if the provided input does not exist.
+
+Here is an example:
+
+```{code-cell}
+import torch
+
+@torch.compile()
+def f(x):
+     return x * x.size()[0]
+
+with torch.compiler.config.patch(dynamic_sources="L['x']"):
+    f(torch.rand(10))
+f(torch.rand(20))
+f(torch.rand(30))
+f(torch.rand(40))
+```
+
+(torch.compiler.set_stance_eager_then_compile)=
+#### `torch.compiler.set_stance ("eager_then_compile")`
+
+At times, identifying the appropriate inputs to mark as dynamic can
+be challenging. If you are willing to accept a performance cost for
+the first batch, another convenient option is to use the
+`eager_then_compile` stances, which automatically determine dynamic
+inputs for you. For more information, see {func}`torch.compiler.set_stance` and [Dynamic Compilation Control with torch.compiler.set_stance](https://docs.pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html).
+
+(torch_compile_dynamic_true)=
+### `torch.compile (dynamic=true)` (Not recommended)
+
+This setting forces all sizes and integers to be dynamic, increasing the
+chance of encountering dynamic shape bugs. Setting this option is not
+recommended due to it  being error prone.
+It would make every input size dynamic which may result it performance
+regressions and ultimately increase compilation time.
+
+PyTorch also provides advanced control options for dynamic shapes, see:
+{ref}`dynamic_shapes_advanced_control_options`.
+
+## Where Do I Go From Here?
+
+If you encounter a framework code bug or an issue with specialization,
+file an issue so it can be reviewed and potentially improved. If the issue
+is within your user code, consider whether you are willing to rewrite your
+code to avoid it. Determine if it affects correctness or if it's a redundant
+check. If the issue involves a Triton custom kernel with a `constexpr`
+argument, evaluate whether you can rewrite it to address the problem.
+
+```{toctree}
+:maxdepth: 1
+compile/dynamic_shapes_core_concepts
+compile/dynamic_shapes_troubleshooting
+compile/dynamic_shapes_advanced_control_options
+compile/dynamic_shapes_beyond_the_basics
+```
+
+```{seealso}
+* [tlparse documentation](https://github.com/pytorch/tlparse)
+* [The dynamic shapes manual](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.fh8zzonyw8ng)
+```