Summary:
Context:
Recently we've added a couple more kernel types support other than inductor generated triton kernels,
such as cpu cpp kernels, extern kernels.
The name appeared in tlparse chrome link can be confusing to users.
Rename from
`inductor_triton_kernel_to_post_grad_nodes.json`
to `inductor_generated_kernel_to_post_grad_nodes.json`
Test Plan: CI
Differential Revision: D75159042
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154046
Approved by: https://github.com/yushangdi
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).
Update: added support for TORCH_LOGS for the metrics logging.
Test Plan:
mm_loop.py and many existing tests cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
Teach the graph printer how to allow overriding printing SymTypes (`SymInt`, `SymFloat`, `SymBool`) and then use that to reuse the fast SymNode printing from `torch._inductor.utils.sympy_str()` to make computing the cache key faster.
On my computer the repro from #151823 goes from 480s -> 80s (still terrible... but better).
Fixes#151823
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151928
Approved by: https://github.com/laithsakka
So when we use mark_unbacked the graph will have an unbacked inputs symInt. Right now,
deferred runtime assertions that uses those is never generated.
This PR changes that, such that in the forward graph we consider those and generate the corresponding
runtime assertions of them. We still ignore them for backward which is not ideal
The way we generate runtime assertion is by emitting them when all the defined unbacked symbols used
in them are seen.
We previously skipped placeholder, because for backward we have a wacky approach were we
ignore input defined unbacked symbols and assumes assertions that uses them are already emitted
in forward and we try to emit all other runtime assertions again. see [Note [Backwards runtime asserts]
Doing that we ends up only emitting the runtime assertions that depends on things defined solely in backward, but we could miss checks that spans inputs defined in both backward and forward, i.e one symbol defined in forward passed as input to backward., and another that is defined in backward.) .This is not ideal an ideal approach could be something like this https://github.com/pytorch/pytorch/pull/151919 but it require more work .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152231
Approved by: https://github.com/aorenste
When CompiledFxGraph is deallocated, its cudagraphifed fn (i.e., `current_callable`) is expected to also be deallocated.
Without graph partition, this is true since the cudagraphified fn is only refered by compiled_fx_graph.current_callable.
However, with graph partition, runner.partitions hold cudagraphified fns while compiled_fx_graph.current_callable holds the runner.call. Thus the cudagraphied fn may not be deallocated when CompiledFxGraph is deallocated. This leads to errors in several unit tests (e.g., test_unaligned_static_input_no_cudagraphs and test_unaligned_static_input_non_trees).
In this PR, we also clean up runner.partitions when CompiledFxGraph is deallocated. This fixes the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152066
Approved by: https://github.com/eellison
standalone_compile needs to get dynamic shape information from
somewhere. We add a new `dynamic_shapes` argument with three options:
1. from the passed-in graph (dynamic="from_graph"). This is the default.
2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs")
3. from the current tracing context (dynamic="from_tracing_context")
1 and 3 are not exactly the same. 2 can also be used for more advanced
things... (specialize on one input but not the other).
Most of this PR is tests.
Test Plan:
- a lot of new tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788
Approved by: https://github.com/oulgen
Summary: Also check the module's named buffers and parameters when resolving name collision
Test Plan:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```
Differential Revision: D73264885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151684
Approved by: https://github.com/angelayi
Summary:
When we have an exported program that looks like this:
```
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, b__tensor_constant0: "f32[1]", ... c_lifted_tensor_0: "i64[925]", …. , tupleized_input_0_0: "f32[10, 2139]",
clone: "i64[925]" = torch.ops.aten.clone.default(c_lifted_tensor_0); c_lifted_tensor_0 = None
index_select: "f32[10, 925]" = torch.ops.aten.index_select.default(tupleized_input_0_0, 1, clone); clone = None
```
The graph after `aot_export_module` could have a name collision, notice that `_tensor_constant0` arg of `clone` is different from the `_tensor_constant0` in the input module .
```
def forward(self):
arg9_1: "f32[10, 2139]"
_tensor_constant0: "f32[1]" = self._tensor_constant0 # this should be int64, conflicted with the original _tensor_constant0, had a clone on this constant before lifting
index: "f32[10, 925]" = torch.ops.aten.index.Tensor(arg9_1, [None, _tensor_constant0]); _tensor_constant0 = None
```
This caused the `tensors used as indices must binary, int...` aoti error on PT2I dashboard because later we used `clone` as index.
We had this error because we created a new `_tensor_constant0` at [here](https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L403-L412), and the new `_tensor_constant0` overrides the original `_tensor_constant0` on the input Module in `_unlift_graph`. The `arg` for `clone` is created at `create_proxy` in `proxy.py`.
To fix this, we do a graph pass before we unlift the graph inputs to avoid name collision
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r aoti_constant_tensor_name_collision
```
Differential Revision: D72761937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151123
Approved by: https://github.com/tugsbayasgalan, https://github.com/jingsh
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
Summary:
dynamo_compile for the most part has been accounting for compile time except autotuning.
all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect.
Add autotuining to dynamo and deprcate all_compilation_types counter.
Differential Revision: D72145447
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293
Approved by: https://github.com/masnesral, https://github.com/jamesjwu
This counter is designed to include all compilation pytorch does (triton +
dynamo_compile). However this wasn't including all of dynamo compilation, since
it was put in at the fx_codegen_and_compile spot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149664
Approved by: https://github.com/masnesral
Summary: Given an explicit error when torchbind object is used as input to AoTI
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_input
```
Differential Revision: D69490915
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149965
Approved by: https://github.com/desertfire