Summary: as title. We've got request from various parties who are interested in turning on the provenance tracking by default. In this PR, we prepare to turn on part of the provenance tracking that doesn't have too much overhead by default.
- Change `provenance_tracking` config to `provenance_tracking_level`
- turn on the following provenance tracking by default when `basic_provenance_tracking`=True
- `set_kernel_post_grad_provenance_tracing` for kernels, this add mapping between triton kernels and post_grad nodes
- `dump_inductor_provenance_info` if we're dumping tlparse log
- `get_graph_provenance_json` and dump `reate_mapping_pre_post_grad_nodes`. This creates mapping between pre_grad and post_grad nodes. Since we're not turning on the provenance tracking in GraphTransformObserver by default, the mapping here maybe incomplete/limited.
- add stack trace from post grad nodes to inductor IR nodes
- add exception swallowing for all functions above
Test Plan:
CI
Rollback Plan:
Differential Revision: D80031559
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160383
Approved by: https://github.com/angelayi
Summary:
In memory planning, some allocation sizes involve unbacked symints. These unbacked symints are not known before they are computed in run time, so **allocation pools that involve unbacked symints cannot be allocated until we have the values of the unbacked symints** .
So we add a notion of `earliest_available` to Allocation nodes. If an allocation node has unbacked symint, it is available at only when its live range begin.
Then in AllocationPool, if a pool involves an Allocation node that has an earliest available time, we restrict its life range.
If a block's earliest available time is later than a pool's life range's start time, we cannot allocate it from the pool.
We also fix a memory leak that's caused by allocating tensor without wrapping it with RAIIAtenTensor.
In python wrapper for JIT inductor, `codegen_alloc_from_pool` doesn't actually write the alloc lines to wrapper, it just returns the string to alloc. However, in cpp_wrapper, `codegen_alloc_from_pool` actually write to the wrapper. Specifically, it writes the following and returns string `RAIIAtenTensorHandle`.
```
AtenTensorHandle handle_name;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__alloc_from_pool(....);
```
This is bug prune. **If you write aoti_torch__alloc_from_pool lines, you must write the RAIIAtenTensorHandle as well**, otherwise you get memory leaks.
We remove the alloc_from_pool call from codegen_create, because this doesn't work for AOTI. In python wrapper, we can generate the same alloc_from_pool variable name for the same block, but cpp_wrapper will generate a different variable name for each call to alloc_from_pool.
Test Plan:
```
python test/inductor/test_memory_planning.py
```
Rollback Plan:
Differential Revision: D79603119
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159839
Approved by: https://github.com/jansel
With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:
```
When an operation mutates a buffer in-place, the scheduler creates a new buffer name
to track the "before" and "after" states, even though they share the same memory.
The mutated buffer represents a rename with zero allocation and deallocation cost.
During dependency tracking, we transfer dependencies from the mutated name back to
the original buffer, ensuring the original memory is only freed when all aliases
are done.
This handles cases where a buffer has multiple non-overlapping aliases - rather than
trying to assign free costs to individual aliases, we forward all alias dependencies
to the original buffer.
Consider:
buf0 = op0()
buf1 = mutation_op_(buf0)
del buf0
...
op(buf1)
del buf1
The only memory events are the creation prior to op0, and the deletion following buf1.
```
As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.
This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`
- change the guard flag from `trace.enabled` to `trace.provenance_tracking`. It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.
In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.
See more motivation in internal Diff
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```
Differential Revision: D78287976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
# Feature
If a Triton kernel has a complicated indexing expression, Inductor may decide to precompute it on the host and pass it to the kernel as an argument. This happens in situations like broadcasts with dynamic shapes.
This PR adds support for this feature to Inductor's FX IR backend.
We generate FX IR for precomputed size args in 3 steps:
1. In `PythonWrapperCodegen`, this PR refactors the relevant code to use a `SymbolicCallArgLine` instead of raw Python strings. This stores a (symbol, expr) pair. (Prior to this PR, it was (str, expr), but changing this to a symbol makes it easier to do substitutions later on.)
2. In `WrapperFxCodegen`, keep a dict of {symbol: expr} arg defs which gets updated whenever we see a `SymbolicCallArgLine`.
3. When the FX backend sees a `KernelCallLine`, it uses this dict to replace symbolic call args with their definitions.
In the longer run, it might be desirable to emit FX nodes defining these symbolic call args. That way, we could reuse the size computation when the same kernel is called multiple times. However, I wasn't sure if there was an existing way to generate FX nodes from a sympy expression, and implementing that seemed like overkill for the present purposes.
# Test plan
Added a new CI test exercising this feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157758
Approved by: https://github.com/jansel
Fixes#155006
Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
Fixes#155006
Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
Since it's now actually used within async_compile.multi_kernel
```
def multi_kernel(self, *args, **kwargs) -> Any:
from torch._inductor.codegen.multi_kernel import MultiKernelCall
# no need to call this in parallel since the sub-kernels are already parallel tasks
return MultiKernelCall(*args, **kwargs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.
Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid
Rollback Plan:
Differential Revision: D76769711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
Delays code generation for arguments to fallback ops. This is inspired by #155642, and likely fixes similar memory leaks.
Additionally, prepare for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:
1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
This PR adds JIT inductor support for user-defined triton kernels using the new host-side TMA api.
* handle TensorDescriptor.from_tensor in ir.py
* codegen TensorDescriptor.from_tensor in wrapper.py
* generate the right signature for functions that take TensorDescriptor arguments (i.e. in the @triton_heuristics.user_autotune decorator)
AOTI support is not implemented yet.
Tests: ran test_triton_kernels.py w/ both Triton 3.3 and 3.4 and there were no failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155814
Approved by: https://github.com/aakhundov
ghstack dependencies: #155777
Summary:
We generate AtenTensorHandles for Fallback kernels regardless of the arg
type. If we indeed "fallback", we will regenerate the AtenTensorHandles
that will cause the first handle being generated not recycled, thus a
memory leak would occur.
Test Plan:
python test/inductor/test_aot_inductor.py -k test_fallback_mem_leak
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155642
Approved by: https://github.com/jingsh, https://github.com/desertfire
During `codegen_inputs`, we check whether there are undefined symbols:
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1668-L1674)
Previously, for graph partition inputs, we do not explicitly add symints.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L3265-L3272)
We relied on sizes/strides of TensorBox for codegen symint inputs. For example, a tensor with shape `[s0, 2]` will implicitly codegen `s0` as an input here. This works fine in most cases since backed symint has to come from some tensor shapes.
65b1aedd09/torch/_inductor/codegen/wrapper.py (L1624-L1632)
In rare cases, this does not work. One example is saved tensors for backward where a tensor may have shape `[2*s0, 2]`. Since `2*s0` is an expression but not a symbol, `codegen_input_symbol_assignment` would not handle `s0` and later there would be an error when `_verify_input_symbol_assignment`.
The fix is add symints to `get_graph_inputs`. An alternative way is to update `codegen_input_symbol_assignment` but I want to minimize the change to graph partition only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154679
Approved by: https://github.com/eellison
For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154698
Approved by: https://github.com/oulgen
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.
Test Plan:
NaN asserts properly generated in example gemm script:
vars = (buf1, primals_2, buf2, primals_1, )
for var in vars:
if isinstance(var, torch.Tensor):
assert not var.isnan().any().item()
assert not var.isinf().any().item()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.
Test Plan:
NaN asserts properly generated in example gemm script:
vars = (buf1, primals_2, buf2, primals_1, )
for var in vars:
if isinstance(var, torch.Tensor):
assert not var.isnan().any().item()
assert not var.isinf().any().item()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
Summary: It is possible to have `reinterpret_tensor` in the output of inductor codegen, e.g. `reinterpret_tensor(buf366, (1024, ), (1, ), 0)` in the return tuple. This adds assertions to all return values from inductor codegen to prevent nans from slipping through and being hard to trace.
Test Plan:
NaN asserts properly generated in example gemm script:
vars = (buf1, primals_2, buf2, primals_1, )
for var in vars:
if isinstance(var, torch.Tensor):
assert not var.isnan().any().item()
assert not var.isinf().any().item()
Differential Revision: D74691131
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154455
Approved by: https://github.com/eellison
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:
1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions
As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire
Prepares for the next PR in the stack by tightening up typing on a `cpp_wrapper` interface that's only used in one (well-typed) place, as well as downstream effects of that change. In particular, this enabled:
1. removing a number of now clearly unnecessary asserts
2. adding a few more targeted asserts to validate the code's current assumptions
3. removing some unneeded control flow in several functions
As far as I can tell, this PR should be functionally neutral. One argument was removed from a `cpp_wrapper` public API, but that argument was unused, and only had a single callsite.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154371
Approved by: https://github.com/desertfire