Commit Graph

242 Commits

Author SHA1 Message Date
Edward Z. Yang
f6be44c74e Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-02 11:50:11 +00:00
PyTorch MergeBot
8d1eaa3da6 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit a6630bcf87.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))
2024-11-02 03:38:15 +00:00
Edward Z. Yang
a6630bcf87 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-01 21:43:25 +00:00
Animesh Jain
2aa5348356 [dynamo][guards] Skip no tensor aliasing guards on parameters (#138954)
This is another unsound guard eval optimization. Its rare in practice to
compile a function with two different parameters as inputs, and then
later call the function with one parameter input as two different inputs
(aliasing). This further reduces guard overhead from 280 us to 240 us
for the model in https://github.com/pytorch/pytorch/issues/138386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138954
Approved by: https://github.com/jansel
ghstack dependencies: #139040
2024-10-29 02:11:47 +00:00
Simon Fan
fd9f4e6770 Back out "[compiled autograd] tls access helpers (#138061)" and Back out "[compiled autograd] Compiled autograd configs in TLS (#137821)" (#139086)
Summary:
Original commit changeset: 9bf80c1492d7

Original Phabricator Diff: D64796226

Original commit changeset: aa1d9ef8f6e6

Original Phabricator Diff: D64796212

Differential Revision: D65072644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139086
Approved by: https://github.com/malfet
2024-10-28 23:37:05 +00:00
Animesh Jain
817b4988e4 [dynamo][config-cleanup] Remove enable_cpp_guard_manager=False codepath (#138512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138512
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-25 16:41:55 +00:00
Mark Kim-Mulgrew
c7a20939b4 Remove unused enforce_cond_guards_match Dynamo feature flag. (#138589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138589
Approved by: https://github.com/clee2000
2024-10-22 19:36:01 +00:00
Simon Fan
49fa437097 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-22 08:03:52 +00:00
PyTorch MergeBot
361f42bc42 Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)"
This reverts commit 9aba0b91c8.

Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))
2024-10-16 16:38:29 +00:00
Simon Fan
9aba0b91c8 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-16 09:28:32 +00:00
Angela Yi
f80ed0b831 [export] Custom op meta kernel generation (two pass) (#137277)
Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi

Test Plan: followup diff (D63837739)

Differential Revision: D63837740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277
Approved by: https://github.com/zou3519
2024-10-07 15:34:19 +00:00
Bob Ren
f0ef7fddde Add ignored/unmaintained comment for capture_autograd_function flag (#137309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137309
Approved by: https://github.com/jansel
ghstack dependencies: #136961
2024-10-04 20:02:37 +00:00
Bob Ren
a1f1f585ab clean up error_on_nested_jit_trace flag (#136961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136961
Approved by: https://github.com/jansel
2024-10-04 02:07:54 +00:00
Bob Ren
13ec343afe clean up capture_func_transforms flag (#136960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136960
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel
2024-10-04 01:10:52 +00:00
Will Feng
a89e3c2490 Add compiled_autograd_kwargs_override Dynamo config (#136967)
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support).

With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing.

With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967
Approved by: https://github.com/xmfan
2024-10-02 06:23:59 +00:00
Yuanhao Ji
be169f743b [Dynamo] Mark config.dead_code_elimination as deprecated (#136933)
part of #136862

For reviewers, all call sites are here: https://github.com/search?q=repo%3Apytorch%2Fpytorch+dead_code_elimination+language%3APython&type=code&l=Python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136933
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-10-01 03:51:59 +00:00
Oguz Ulgen
a28b40fa74 Improve is_fbcode functionality (#136871)
Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof.

Test Plan: unit tests

Reviewed By: masnesral

Differential Revision: D63545169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871
Approved by: https://github.com/masnesral
2024-09-27 21:19:01 +00:00
Edward Z. Yang
a2d2a30311 Add torch._dynamo.config.fail_on_cache_limit_hit (#136767)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136767
Approved by: https://github.com/albanD, https://github.com/jansel
ghstack dependencies: #136533
2024-09-27 03:58:00 +00:00
William Wen
95e976a63f [dynamo] recursively skip frames when Dynamo cache limit is hit (#135144)
Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723).

In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-06 21:38:53 +00:00
Animesh Jain
058a69d91a [fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928)
As Title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928
Approved by: https://github.com/ezyang
2024-09-05 22:05:54 +00:00
Laith Sakka
d6091c8726 Add compile time instruction count metric (#133834)
PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out
as of this diff, compile_time_instruction_count counts the number of instruction from within
convert_frame.compile_inner
```
update_hint_regression,compile_time_instruction_count,10522459165
```
 will add result from CI once populated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834
Approved by: https://github.com/aorenste
2024-08-27 23:29:02 +00:00
Edward Z. Yang
66d6d8b1b9 Support TORCH_COMPILER_COLLECTIVES envvar (#133696)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133696
Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o
2024-08-19 20:13:04 +00:00
Will Feng
6790eb52f9 [Traceable FSDP2] Set torch._dynamo.config.skip_fsdp_hooks to True by default (#133531)
Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531
Approved by: https://github.com/awgu
ghstack dependencies: #133532
2024-08-16 17:18:42 +00:00
Aart Bik
a8490a0762 [traced-graph][sparse] propagate sparsity in fx graph (#131920)
This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well.

Feature request:
https://github.com/pytorch/pytorch/issues/117188

Rebranch of older PRs (for history):
https://github.com/pytorch/pytorch/pull/131474
https://github.com/pytorch/pytorch/pull/128549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920
Approved by: https://github.com/ezyang
2024-08-05 15:49:53 +00:00
Xuehai Pan
e74ba1b34a [BE][Easy][15/19] enforce style for empty lines in import segments in torch/_d*/ (#129767)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767
Approved by: https://github.com/anijain2305
2024-07-31 21:18:11 +00:00
Animesh Jain
f44446e851 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #132053
2024-07-29 20:01:51 +00:00
PyTorch MergeBot
7ef927da15 Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 6de65d5dd4.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/atalman due to Broke CI: dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10132084288/job/28016215101) [HUD commit link](6de65d5dd4) ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2255839646))
2024-07-29 12:48:27 +00:00
Animesh Jain
6de65d5dd4 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #131744, #131928, #131948
2024-07-28 13:23:00 +00:00
PyTorch MergeBot
14920c149b Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 0455344777.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_quantized_linear_amx_dynamic_shapes_batch_size_16_in_features_4_out_features_64_bias_True_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/10102272826/job/27938970118) [HUD commit link](0455344777) not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2251609554))
2024-07-26 00:12:40 +00:00
Animesh Jain
0455344777 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang
ghstack dependencies: #131744
2024-07-25 22:14:17 +00:00
Edward Z. Yang
0c6f1ca064 Introduce torch._dynamo.config.enable_compiler_collectives for syncing compilation across ranks (#130935)
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:

1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks

The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.

Here is how automatic dynamic synchronization works:

1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.

For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
2024-07-24 11:24:11 +00:00
Pian Pawakapan
988ed4d5db [export] clean up allow_complex_guards_as_runtime_asserts flag (#130596)
Summary: removes underscore, cleans up dead code in DimConstraints

Test Plan: existing export tests

Reviewed By: angelayi

Differential Revision: D59612746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596
Approved by: https://github.com/angelayi
2024-07-12 17:17:11 +00:00
Yanbo Liang
111f9b5d44 [Dynamo] Add config to skip/inline torchrec (#129912)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912
Approved by: https://github.com/anijain2305
2024-07-03 00:14:51 +00:00
Elias Ellison
b8e5678ad2 Delete lazy ddp optimizer (#120727)
This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides.

Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727
Approved by: https://github.com/jansel, https://github.com/yf225
2024-06-26 21:53:54 +00:00
Will Feng
575bc1e3af [Reopen #114036] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295
Approved by: https://github.com/Chillee
2024-06-25 23:47:08 +00:00
Will Feng
dadc0ed4c8 [Traceable FSDP2] Add aot_eager backend E2E tests for transformer model (#129157)
This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157
Approved by: https://github.com/awgu
ghstack dependencies: #129203
2024-06-23 06:11:11 +00:00
Aaron Orenstein
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
Michael Lazos
2129903aa3 Properly detect nested torch function args (#127496)
Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed.
Fixes https://github.com/pytorch/pytorch/issues/127174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496
Approved by: https://github.com/anijain2305
2024-06-02 03:43:22 +00:00
Simon Fan
ec098b88b6 [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-31 04:38:20 +00:00
PyTorch MergeBot
ce63b676f3 Revert "[compiled autograd] torch.compile API (#125880)"
This reverts commit e1c322112a.

Reverted https://github.com/pytorch/pytorch/pull/125880 on behalf of https://github.com/atalman due to sorry your PR broke lint, need to revert ([comment](https://github.com/pytorch/pytorch/pull/125880#issuecomment-2139605376))
2024-05-30 13:53:31 +00:00
Simon Fan
e1c322112a [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-30 02:10:06 +00:00
Pian Pawakapan
8a31c2aa84 [export] allow complex guards as runtime asserts (#127129)
With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that.

For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs.

In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph.

Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes:
```
# reshape
def forward(self, x, y):  # x: [s0, s1], y: [s2]
    return x.reshape([-1]) + y  # guard s0 * s1 = s2

This leads to the following exported program

class GraphModule(torch.nn.Module):
    def forward(self, x: "f32[s0, s1]", y: "f32[s2]"):
        sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0)
        mul: "Sym(-s2)" = -1 * sym_size_int;  sym_size_int = None
        sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0)
        sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1)
        mul_1: "Sym(s0*s1)" = sym_size_int_1 * sym_size_int_2;  sym_size_int_1 = sym_size_int_2 = None
        add: "Sym(s0*s1 - s2)" = mul + mul_1;  mul = mul_1 = None
        eq: "Sym(Eq(s0*s1 - s2, 0))" = add == 0;  add = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0*s1 - s2, 0) on node 'eq'");  eq = None

        view: "f32[s0*s1]" = torch.ops.aten.view.default(x, [-1]);  x = None
        add_1: "f32[s0*s1]" = torch.ops.aten.add.Tensor(view, y);  view = y = None
        return (add_1,)
```
Another case is symbol divisibility:
```
def forward(self, x):  # x: [s0, s1]
    return x.reshape([-1, x.shape[0] - 1])  # Eq(Mod(s0 * s1, s0 - 1), 0)
```

Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        # check that negation of first guard also shows up as runtime assertion
        if x.shape[0] == y.shape[0]:  # False
            return x + y
        elif x.shape[0] == y.shape[0] ** 3:  # False
            return x + 2, y + 3
        elif x.shape[0] ** 2 == y.shape[0] * 3:  # True
            return x * 2.0, y * 3.0
```
For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard.

One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given.

As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related.

Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925)

This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work.

NOTE: will take naming change suggestions for the flag :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129
Approved by: https://github.com/avikchaudhuri
2024-05-29 17:15:25 +00:00
dshi7
0f67d38f0f add TORCHDYNAMO_CAPTURE_DYNAMIC_OUTPUT_SHAPE_OPS (#127017)
tlparse prints failure description like this

> dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True

adding os env var to set it easier for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017
Approved by: https://github.com/jackiexu1992
2024-05-25 05:42:41 +00:00
Alexandre Ghelfi
b3a8a3cbab Fix typos in torch._dynamo.config.py (#126150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126150
Approved by: https://github.com/Skylion007
2024-05-14 14:27:35 +00:00
Edward Z. Yang
2ba102f689 Implement native support for float inputs in Dynamo and ShapeEnv (#125325)
The big idea is that floats are treated as Tensors on input/output to the FX graph, but on the inside, we immediately call item() on the synthetic Tensor and record regular float operations on it. Canonicalization to Tensor operations will happen in a standalone FX pass. This behavior is controlled by `specialize_float` config variable when set to False.

The generated graph looks like this for the test `test_unspec_float_output`:

```
 def forward(self, L_x_: "f32[3]", L_y_: "f32[]"):
     l_x_ = L_x_
     l_y_ = L_y_

     # File: /data/users/ezyang/a/pytorch/test/dynamo/test_unspec.py:511 in f, code: return x + 1, y * 2
     add: "f32[3]" = l_x_ + 1;  l_x_ = None
     item: "Sym(zf0)" = l_y_.item();  l_y_ = None
     mul: "Sym(2*zf0)" = item * 2;  item = None
     scalar_tensor: "f32[]" = torch.scalar_tensor(mul);  mul = None
     return (add, scalar_tensor)
```

The ingredients:

* **torch/_dynamo/variables/builder.py** When `specialize_float` is False, we wrap float literals with `wrap_symfloat`. This is an unholy mashup of `wrap_symint` and `wrap_unspecialized_primitive`. The overall strategy is that we first generate a tensor argument (because that's what we want to show up into the FX graph), but then immediately call item() on the tensor argument to get a SymNodeVariable, which we will do the rest of the tracing with.  Importantly, this SymNodeVariable is backed with the source of the original float: this means we can guard on the resulting value (something we could NOT do with UnspecializedPythonVariable). This has to be done manually, because if you literally call item() on the tensor, you will end up with an unbacked float. There is a bit of copy paste from wrap_symint and wrap_unspecialized_primitive which we can try to factor out, but this really is its own thing and you should review every line of code in the function.
* **torch/fx/experimental/symbolic_shapes.py** We now can generate guards on float inputs, and these guards are handled inside of ShapeEnv. So we need to be able to allocate (backed!) float symbols, and produce guards for them. Fairly straightforward generalization.
* **torch/_dynamo/codegen.py** I also need to maintain the invariant that there are no float outputs to the FX graph. I chose to do this at codegen time. When we detect a SymNodeVariable on the return stack for a float, we on the fly convert it (via `as_tensor`) to a TensorVariable, which is the true output. We then special case the output bytecode to call item() on it again. The tensor conversion is memoized on SymNodeVariable since we typically run the code generation process twice.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125325
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-05-14 04:10:01 +00:00
Animesh Jain
0935b3d794 [dynamo] Turn on guard_nn_modules (#125202)
Turning on guard_nn_modules adds large number of guards, so we are bound to take a perf hit. But the perf hit is small. These are the numbers

![image](https://github.com/pytorch/pytorch/assets/13822661/c8793906-c8c7-432b-9af4-4594713067be)

First we observe that compared to Python guards, C++ guards give around 6x speedup. This reduces the total time spent in guards. This is shown in the last column (cpp_guards/inductor_optimized_latency). The worst model is around 1.61%, with most of the models below 1%. I think this is good enough signal to turn the config on.

One might also wonder how much guard slowdown occurs with `guard_nn_modules=True`. This is the table
![image](https://github.com/pytorch/pytorch/assets/13822661/932a885b-1c03-424b-8405-5bc8fd35dd39)

For most models, the guard overhead with nn module guards is under 2x. There are a few outliers, where the slowdown is really high and for those models we spend 1%-2% time in C++ guards as shown in first table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125202
Approved by: https://github.com/ezyang
2024-05-11 19:28:24 +00:00
Animesh Jain
b62e89c1b8 [dynamo] Do not turn on record relay with TORCH_COMPILE_DEBUG (#125488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125488
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-05-04 05:10:31 +00:00
Boyuan Feng
b91f83f181 [cudagraph] add config for cudagraph managed input mutation support (#124754)
Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature.

Test Plan: ci

Differential Revision: D56481353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754
Approved by: https://github.com/eellison
2024-04-24 04:23:53 +00:00
Animesh Jain
704fac5618 [dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231
Approved by: https://github.com/jansel
ghstack dependencies: #124230, #124237
2024-04-18 06:36:20 +00:00
PyTorch MergeBot
530bf391cc Revert "[dynamo] Turn on CPP guard manager (#123547)"
This reverts commit 3e98bdd66d.

Reverted https://github.com/pytorch/pytorch/pull/123547 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058337419))
2024-04-16 06:38:15 +00:00