Commit Graph

665 Commits

Author SHA1 Message Date
PyTorch MergeBot
361f42bc42 Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)"
This reverts commit 9aba0b91c8.

Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))
2024-10-16 16:38:29 +00:00
Tom Ritchford
af27f7888b [dynamo] Remove an unused variable in AOTDispatchAutograd (#137989)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137989
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-16 16:37:19 +00:00
Tom Ritchford
15722debfb Remove two unused variables in _functorch/partitioners.py (#137998)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137998
Approved by: https://github.com/Skylion007
2024-10-16 10:58:31 +00:00
Simon Fan
9aba0b91c8 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-16 09:28:32 +00:00
Brian Hirsh
ed94725b8c log ViewAndMutationMeta to trace_structured (#133784)
I ended up bundling it into the existing tlparse logs for the AOT forward graph, since it looked like registering it as a separate artifact requires changes to tlparse itself (maybe that is wrong though?)

Example new fw AOT graph tlparse output for the below code: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp70zKiO/0_0_0/aot_forward_graph_2.txt

```
import torch

@torch.compile
def f(x):
    out1 = torch.view_as_complex(x)
    out2 = torch.view_as_complex(x)
    return out1, out2, x * 2

x_ = torch.randn(4, 2, requires_grad=True, dtype=torch.float64)
out = f(x_)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133784
Approved by: https://github.com/ezyang
2024-10-15 02:49:02 +00:00
Avik Chaudhuri
8262f6d271 fix test_lazy_module_kwargs (#137705)
Test Plan: fixed

Differential Revision: D64185644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137705
Approved by: https://github.com/tugsbayasgalan
2024-10-11 01:53:10 +00:00
Bob Ren
36133f39db Tensorify compute on Python scalars (#136674)
Signed-off-by: Bob Ren <bobrenfb.com>

Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674
Approved by: https://github.com/ezyang
2024-10-09 18:51:41 +00:00
James Wu
4d45536e92 Save aot graph code in AOTAutogradCache for logging purposes (#137432)
Save the string graph code from print_readable

Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432
Approved by: https://github.com/bdhirsh
ghstack dependencies: #137431
2024-10-09 16:59:08 +00:00
Masaki Kozuki
b71d0ac3b1 remove unused variable (#137565)
per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565
Approved by: https://github.com/Skylion007
2024-10-09 16:31:43 +00:00
PyTorch MergeBot
2fff990c16 Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)"
This reverts commit 932b9945c0.

Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))
2024-10-09 04:53:30 +00:00
Parikshit Shah
932b9945c0 [AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)
Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes.

Test Plan: tbd

Reviewed By: Chillee, basilwong

Differential Revision: D63714905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314
Approved by: https://github.com/eellison, https://github.com/basilwong

Co-authored-by: Parikshit Shah <parikshit@meta.com>
2024-10-09 00:39:29 +00:00
Edward Z. Yang
b499083a91 Get rid of quadratic tests to has_same_metadata (#136857)
Fixes https://github.com/pytorch/pytorch/issues/136852

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857
Approved by: https://github.com/isuruf, https://github.com/bdhirsh
2024-10-08 20:49:23 +00:00
Avik Chaudhuri
28493efe6e fix silly mapping issue with torch.Size (#137465)
Test Plan: added test

Differential Revision: D64022949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465
Approved by: https://github.com/yushangdi, https://github.com/angelayi
2024-10-08 16:53:15 +00:00
PyTorch MergeBot
796c3c3415 Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221)"
This reverts commit 7e13e7dd7e.

Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))
2024-10-07 21:46:13 +00:00
Pian Pawakapan
f33ffd01f2 [export] fix joint graph metadata (#136011)
Differential Revision: D62652832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011
Approved by: https://github.com/tugsbayasgalan
2024-10-07 19:36:44 +00:00
James Wu
4db199f15f Implement Remote AOTAutogradCache (#137278)
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.

Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency

See that it cache hits even with local cache removed.

Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj

New unit tests

Reviewed By: oulgen

Differential Revision: D63323958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
2024-10-07 15:38:54 +00:00
Avik Chaudhuri
6a6a8b17b8 handle state tensors in training ir path (#137240)
Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed.

Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export`

Differential Revision: D63802576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240
Approved by: https://github.com/tugsbayasgalan
2024-10-04 20:23:48 +00:00
rzou
f500cb43bb Fix torch.library.register_vmap (#137306)
We didn't support multiple levels of vmap. The main problem is, during
the batching rule, we need to exclude the vmap dispatch key
(FuncTorchBatched) like how our C++ batching rules do it.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306
Approved by: https://github.com/Chillee
2024-10-04 03:46:35 +00:00
rzou
7e13e7dd7e Disallow FakeTensor.data_ptr access in eager mode (#137221)
Previously we raised a deprecation warning (beginning PyTorch 2.4). Now
that we are on 2.6, we're completing the deprecation and disallowing
this behavior.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221
Approved by: https://github.com/albanD, https://github.com/eellison
2024-10-03 23:47:55 +00:00
James Wu
4d3c0fc061 [AOTAutogradCache] add config for AOTAutograd remote cache (#137011)
Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff.

This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE

Test Plan: Existing tests should pass and should build

Reviewed By: oulgen

Differential Revision: D63321965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011
Approved by: https://github.com/oulgen
2024-10-03 16:03:47 +00:00
Brian Hirsh
bf73af4b4e dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878)
Previously if we had a graph like:
```
        triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...)
        getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr']
        getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr']
        sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1)
        mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid)
```

The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning:

(1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel)

(2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent)

Reviewed By: embg

Differential Revision: D63551393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878
Approved by: https://github.com/zou3519
2024-10-02 13:52:44 +00:00
James Wu
dc8c0aaf4d [AOTAutogradCache] Log time taken_ns (#136529)
Summary:
This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry.

This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events.

Test Plan: Run benchmark, see that the times are in the chromium events.

Reviewed By: aorenste

Differential Revision: D62590077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529
Approved by: https://github.com/oulgen
2024-09-27 16:14:09 +00:00
IvanKobzarev
34d788ffb0 [aotd] Do not force contiguous() for channels_last (#135225)
Original Issue: https://github.com/pytorch/pytorch/issues/134644

We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing.

=>
Tracing time:
- Store trace_tangents_memory_formats in metadata
- Coerce tangents to deduced memory_format

Runtime:
- Coerce tangents to tracing memory format from metadata

Subclasses logic:
 - Previously coercing tangents logic did not handle nested subclasses case, fixing this.

For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass:
[subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ]

If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure.

The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there  is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method).

Other small change:
Remove duplicated not-related comment.

Testing

```
python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous
```

Benchmarking:
After change:
```
└─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous
Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms
Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms
```
Before change:
```
BEFORE_CHANGE SUBCLASS 4.1194
```

No siginificant changes in processing time.

(We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225
Approved by: https://github.com/bdhirsh
2024-09-27 15:01:20 +00:00
IvanKobzarev
9581508383 [aotd] Cleanup on subclasses in inductor freezing (#136549)
Cleanup:
1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper
2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat.
Verfified fbcode tests compiled_optimizers

Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549
Approved by: https://github.com/bdhirsh
2024-09-27 11:20:03 +00:00
IvanKobzarev
342c031f0e [aotd] Fix freezing API for subclasses (#136265)
Original issue:
https://github.com/pytorch/ao/issues/890

The problem:

TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.

flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
2024-09-24 13:15:01 +00:00
James Wu
4649aeaebf Make AOTAutogradCache support remote FXGraphCache (#136173)
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.

This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.

Test Plan: (Meta only tests)

Reviewed By: aorenste

Differential Revision: D62384944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
2024-09-23 17:24:27 +00:00
PyTorch MergeBot
df6a8fa1eb Revert "[aotd] Fix freezing API for subclasses (#136265)"
This reverts commit cdef760560.

Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))
2024-09-23 16:25:05 +00:00
IvanKobzarev
cdef760560 [aotd] Fix freezing API for subclasses (#136265)
Original issue:
https://github.com/pytorch/ao/issues/890

The problem:

TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.

flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
2024-09-20 16:32:49 +00:00
Will Feng
a815611db9 [Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727)
If node is AC region output and has a backward hook on it, we intentionally choose to save it.
This is to work around circular dependencies in Traceable FSDP2+AC.
Example:
```
out = fully_shard(utils.checkpoint(module))(x)
norm_out = layer_norm(out)
```
and there is a circular dependency:
1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`.
2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed.
3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad`  -> circular dependency with (1)!

Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency.

----

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727
Approved by: https://github.com/Chillee
2024-09-14 08:45:58 +00:00
whywhy-rtx3090
7647c398ff Allow optional positional arguments for torch.func.functional_call (#134643)
This PR resolves #134408. Add an additional test and have passed the local test.

Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs.

This PR does not include any such post-check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643
Approved by: https://github.com/zou3519
2024-09-12 15:22:06 +00:00
Will Feng
94d2471d1f [Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730)
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).

This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.

------

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
2024-09-11 23:01:05 +00:00
Will Feng
84ae6b7d6b AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193)
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:

Part of FSDP2's tracing strategy right now is that:

(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated

(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)

(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.

It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.

However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).

So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:

(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)

(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-09-06 14:06:48 +00:00
Edward Z. Yang
d0591f4658 Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/

This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that:

I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).

In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.

I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
2024-09-06 13:13:15 +00:00
Bob Ren
30b98940b8 Fix typo in comment (#135111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135111
Approved by: https://github.com/aorenste, https://github.com/oulgen
2024-09-05 01:39:04 +00:00
Benjamin Glass
43c9b4e0e6 Fix unintentional deduplication of returned tensors (#134726)
When CSE was used, returned tensors that had gone through identical
processing steps but were distinct from a data perspective were pruned
out of the graph.  This commit protects tensors which are directly
output from being pruned, and adds a test for this behavior.

Closes #88813 and #114344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134726
Approved by: https://github.com/amjames, https://github.com/zou3519, https://github.com/bdhirsh
2024-09-04 23:42:56 +00:00
PyTorch MergeBot
fc07e6bf56 Revert "Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)"
This reverts commit a178a053ad.

Reverted https://github.com/pytorch/pytorch/pull/135053 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))
2024-09-04 17:18:21 +00:00
Laith Sakka
c8ab9b06a2 Redesign custom op functionlaization for better re-inplace (#134409)
- The new implementation (auto_functionalized_v2) is enabled by default but can be disable
 using an inductor flag.
- In export mode the old implementation is used.

**Motiviation**
Previous functionalization fails to re-inplace arguments when they are view over other tensors.
see issue https://github.com/pytorch/pytorch/issues/131192
The new functionalization is easier to re-inplace for views.

**A) Functionalizations pass**
consider a program:

```

func(t)
    x = t[0]
    y = t[1]
    foo(x, y) # custom operator with x, y mutable
    return (x, y, t)
```

- To functionalize `foo` we generate a function that operates on the base tensors of the inputs;  (x.base() and y.base())
and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())```

- Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function.

   ```
   auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default,
     _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 ,
     _y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1   ,
     _all_bases = [arg0_1])
   ```
 -  In the code above:
        - _all_bases[t]: refers to a unique set of bases for all foo arguments.
        - for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1)  regenerate x from _all_bases[_x_base_index] or a copy of a the base.

-  the output of auto_functionalized is foo output , followed by x tensors one for each base in  _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base.

-  for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output

 for the function above after functionalization we get :
 ```
    def forward(self, arg0_1: "f32[2][1]cpu"):
        auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1])
        getitem_1: "f32[2][1]cpu" = auto_functionalized[1];  auto_functionalized = None
        copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1);  arg0_1 = copy_ = None

        # No stacktrace found for following nodes
        select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0)
        select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1);  getitem_1 = None
        return (select_2, select_3)
```

**B) Semantics of  auto_functionalize**
The new semantics of auto_functionalize is as the following:
1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it)
2. For each arg, regenerate the arg from the copy of its base using the view information above.
3. return the original foo output followed by the new bases.

**C) Re-inplace pass**
since auto_functionalize not copy the bases, what we actually inplace is the bases.
 (run just like before but on the beses instead of args).

1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409
Approved by: https://github.com/zou3519
2024-09-04 17:08:58 +00:00
Edward Z. Yang
a178a053ad Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/

I'm not sure this is the right approach though...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
ghstack dependencies: #134407
2024-09-04 13:25:08 +00:00
Aaron Orenstein
7239b8a4f1 Clean up RemoteCache classes (#134032)
Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032
Approved by: https://github.com/oulgen, https://github.com/bhack
2024-08-31 20:18:59 +00:00
rzou
092349dcdd Never CSE aten.empty in the partitioner (#134703)
aten.empty is almost always fusible into its consumer, so we never CSE
it. This fixes a bug that looks like the following:

```py
@torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"})
def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None:
    out_sin.copy_(x.sin())
    out_cos.copy_(x.cos())

@torch.compile
def f(x):
    out0 = torch.empty_like(x)
    out1 = torch.empty_like(x)
    sin_cos(x, out0, out1)
    return x.clone(), out0, out1

x = torch.randn(3, requires_grad=True)
f(x)
```

- cse would de-duplicate the empty nodes
- reinplacing would add an additional clone (because it can't write to
  both tensors at the same time)
- the clone lowers into a new buffer + a copy_ kernel
- the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional
  buffer, it doesn't matter what the values are.

We could attempt to fix this on the reinplacing side but this seemed
better as a partitioner heuristic and the reinplacing fix is a bit more
tricky (we'd need to identify that the op never reads from the empty
node).

Test Plan:
- new test (the old number was 27, the new number is 21, so this PR
  helped).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703
Approved by: https://github.com/yf225
ghstack dependencies: #134466, #134490, #134491
2024-08-29 13:51:19 +00:00
Yanbo Liang
8693322ef0 [Dynamo][autograd.Function] Support mark_non_differentiable (#134087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134087
Approved by: https://github.com/zou3519
2024-08-28 08:12:37 +00:00
rzou
c582602245 Update partitioner's is_fusible heuristic to respect triton kernels (#134491)
mutated arguments to triton kernels are fusible into the triton kernel.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134491
Approved by: https://github.com/Chillee
ghstack dependencies: #134364, #134466, #134490
2024-08-27 15:57:32 +00:00
rzou
c7cbcdad76 Update partitioner's is_fusible heuristic to respect auto_functionalized (#134490)
We say Node a is fusible into node b if node b is an auto_functionalized
node that may reinplace node a later on.

This PR also changes aten.empty to be recomputable w.r.t the Partitioner
(it is, like aten.zeros, cheap to recompute and fusible into other ops).

Fixes https://github.com/pytorch/pytorch/issues/134468

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134490
Approved by: https://github.com/Chillee
ghstack dependencies: #134364, #134466
2024-08-27 13:05:01 +00:00
IvanKobzarev
8ae4f82243 [aotd] Support HOP effects in backward (#132638)
Support of effectful operations in backward:

1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function .

FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`.

2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper.
Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs.

2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals.
_aot_autograd/utils.py

3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward.

For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward.

There are additional changes in partitioner to improve functionality of 'must_be_in_backward'

4/ Unlift tokens now should run for both forward and backward.
- As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation
- In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs.

5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens.

Tests:
```
python test/higher_order_ops/test_with_effects.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638
Approved by: https://github.com/bdhirsh
2024-08-23 15:30:58 +00:00
Aaron Orenstein
d95aedf5fd [BE] typing for decorators - fx/_compatibility (part 1) (#134202)
Part of #134054.

This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
2024-08-22 17:07:33 +00:00
James Wu
3c5485fb7f [Retry] Log chromium events to scuba (#134118)
Summary:
This diff implements a bunch of views for internal scuba viewing.

TODOS that I might punt to another diff:
- Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more.
- We should definitely log frame id, compile id, etc
- We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on.
- idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk

Test Plan:
All of the above views are run with nanogpt benchmark:

```
buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance
```

Differential Revision: D61603243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118
Approved by: https://github.com/oulgen
2024-08-22 14:59:45 +00:00
IvanKobzarev
57625bacea [partitioner] Fix must_be_in_backward corner cases (#134002)
Preparation PR for https://github.com/pytorch/pytorch/pull/132638

"must_be_in_backward" fails the partitioner, if partitioner picks this node as saved_values.

The fix is to prevent partitioner to pick those nodes during nodes classification.

It's hard to make a test without making effectful ops in backward "must_be_in_backward", which will be testing this ( https://github.com/pytorch/pytorch/pull/132638 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134002
Approved by: https://github.com/bdhirsh
ghstack dependencies: #134003
2024-08-21 15:58:49 +00:00
JackCaoG
27dfd63ee8 remove unnecessary slicing in EffectTokensWrapper (#133737)
In the cases that `outs ` is a tensor, `[0:]` will cause a nadditional slicing ops that's unnecessary and failed some of XLA's unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133737
Approved by: https://github.com/IvanKobzarev
2024-08-18 05:52:48 +00:00
James Wu
0bde3c4f2f Run cudagraphs on AOTAutograd cache hit (#132294)
This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit.

Specifics:
- AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward.
- We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time.

```
TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv
```
Twice, once on cache miss and once and cache hit.

Here is the perfetto trace for each(FB only link):

**Cache Miss:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.66 seconds
I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry
I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry
I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 101.24 seconds
Average tokens/sec: 147.96 tokens/sec
Average bandwidth achieved: 1955.22 GB/s
Memory used: 14.51 GB
```

Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb)

**Cache Hit:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.67 seconds
I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 9.40 seconds
Average tokens/sec: 147.94 tokens/sec
Average bandwidth achieved: 1954.98 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key

![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294
Approved by: https://github.com/eellison
2024-08-17 21:24:54 +00:00
Simon Fan
0a6cc15079 [compiled autograd] use same graph node names as AOTDispatcher (#133148)
FIXES https://github.com/pytorch/pytorch/issues/132939

Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other

hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html
fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html
Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]cpu" = inputs[0]
        aot1_primals_1: "f32[4]cpu" = inputs[1]
        aot1_primals_2: "f32[4]cpu" = inputs[2]
        aot0_sin: "f32[4]cpu" = inputs[3]
        aot0_cos: "f32[4]cpu" = inputs[4]
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None
        return []
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None

         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None

         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133148
Approved by: https://github.com/jansel
ghstack dependencies: #133115
2024-08-17 00:46:52 +00:00