Summary:
dynamo_compile for the most part has been accounting for compile time except autotuning.
all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect.
Add autotuining to dynamo and deprcate all_compilation_types counter.
Differential Revision: D72145447
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293
Approved by: https://github.com/masnesral, https://github.com/jamesjwu
This counter is designed to include all compilation pytorch does (triton +
dynamo_compile). However this wasn't including all of dynamo compilation, since
it was put in at the fx_codegen_and_compile spot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149664
Approved by: https://github.com/masnesral
Summary: Given an explicit error when torchbind object is used as input to AoTI
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r test_torchbind_input
```
Differential Revision: D69490915
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149965
Approved by: https://github.com/desertfire
This PR threads through the correct boxed_forward_device_index from graph_kwargs to CompiledFXGraph.post_compile. This allows us to correctly update BoxedDeviceIndex from cache hits.
We don't actually need to save `boxed_forward_device_index` in CompiledFXGraph because its value is in the cache key, so it always matches to the ambient one anyway. On forward with cudagraphs enabled, derive `boxed_forward_device_index`'s value from `device_idxs`.
Testing:
```
python benchmarks/dynamo/cachebench.py --mode training --benchmark torchbench --model BERT_pytorch --device cuda --repeat 1 --dynamic --output="dynamic.json"
```
Now cache hits properly on FXGraphCache. AOTAutogradCache has a guard failure. Will look into that as a followup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148130
Approved by: https://github.com/eellison
Summary:
We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly.
- _fakify_script_objects in `compile_fx`
- Allow fake torchbind objects in `torchbind_constants`
Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens.
Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API.
Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`.
Test Plan:
```
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind
buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc
```
Differential Revision: D70013257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529
Approved by: https://github.com/angelayi
This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR.
## Example
```python
import torch
torch._inductor.config.graph_partition = True
def f(x, y):
x1 = x + 1
y1 = y + 1
y_cpu = y1.cpu() + 1
z = x @ y
return x1 + y1 + z + y_cpu.cuda()
x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)]
x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]]
eager_out = f(x, y)
f_compiled = torch.compile(f, mode="reduce-overhead")
for _ in range(5):
compiled_out = f_compiled(x_cloned, y_cloned)
assert torch.allclose(eager_out, compiled_out)
```
w/o graph partition, we will skip cudagraph:
```
skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from :
File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f
y_cpu = y1.cpu() + 1 # 3
```
w/ graph partition, we can see two cudagraphify under the same torch-compiled region:

## Design
PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions.
The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is:
- compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`.
- during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`.
- finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`.
#### Q: How does it work with codecache?
While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later.
## Edge Case 1
PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`.
## Edge Case 2
Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420)

To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648
Approved by: https://github.com/eellison
Summary:
Do not fold torchbind objects in constant folding
Any operation on these torchbind objects can have arbitrary side effects, so we can't effectively constant fold anything torchbind-obj-related anyway.
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding
```
Reviewed By: angelayi
Differential Revision: D69946541
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148993
Approved by: https://github.com/angelayi
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential [disconnected] Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.
This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note:
* Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries
* The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id
* I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile.
Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`:
* tlparse: https://fburl.com/21yrdn8h
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/r9mp7uiv
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220
Approved by: https://github.com/eellison
Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).
Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.
Fixed the test which caused the previous version (#146134) to be reverted:
```
$ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635
Approved by: https://github.com/jamesjwu
Summary:
previously the dynamo counters does not print the counts information automatically.
explicitly added a log msg to print after lowering for overview info for inductor aten mms
it will look like:
the name is in `{aten_op_name}_{m}_{n}_{k}`
```
torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx
```
{F1975874802}
Test Plan:
```
TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```
Differential Revision: D70739912
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716
Approved by: https://github.com/henrylhtsang
Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).
Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134
Approved by: https://github.com/jamesjwu
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Note the attached diff contains some minor fbcode-only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
Summary:
support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface
some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end
Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link)
Differential Revision: D68909278
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064
Approved by: https://github.com/frank-wei
Summary: Add a second more generic waitcounter to torch.compile. We'll keep expanding this as new generic pytorch compilation sites show up.
Test Plan: Waitcounter only change, relying on existing tests.
Differential Revision: D69215401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146723
Approved by: https://github.com/davidberard98
Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in :
```
compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn)
# run compiled_1
out_compiled = compiled_1
compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2)
```
Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher.
Found this bug testing internal inference models.
This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: https://github.com/pytorch/pytorch/pull/143408
Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146925
Approved by: https://github.com/laithsakka, https://github.com/jansel
ghstack dependencies: #146417
Summary:
Add `node_mapping = create_node_mapping(pre_grad_graph_id, inductor_post_to_pre_grad_nodes, debug_info)`, to produce a `inductor_provenance_tracking_node_mappings.json` file. This file will be used by the provenance tracking highlighter tool to create provenance visualization.
`inductor_triton_kernel_to_post_grad_nodes.json` and `inductor_provenance_tracking_node_mappings.json` files are not dumped if they are both empty. So it's removed from some of the `test_structured_trace` tests.
Test Plan:
CI
```
buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
python test/dynamo/test_structured_trace.py
```
Differential Revision: D68190173
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146103
Approved by: https://github.com/chenyang78
Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only.
Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868
Approved by: https://github.com/eellison
Landing D67612181 here. The original exported PR somehow fails OSS CI, but this one doesn't (though the PR content is the same).
Add debug trace artifact to inductor_triton_kernel_mapping_post_grad.json (debug artifact for provenance tracking) to tlparse.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145954
Approved by: https://github.com/YUNQIUGUO
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth. One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed).
The fix is when we allocate the tensor, instead of doing something like:
```
buf0 = randn_strided([2048, 2047], [2048, 1])
```
we do some small overallocation
```
buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1])
```
cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #140249