dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.
The suppressions were added automatically with the following script generated by GPT-4:
```
import re
# Read the error file
with open("error_file.txt", "r") as f:
errors = f.readlines()
# Parse the lines with errors and error types
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
This diff introduce the following changes:
1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string
why is this needed?
I was compiling an expression:
x*abs(y) where y =-2
what happens is that this expression is passed as ``s1*abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs.
but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true)
resulting in ``x*abs(ks0) = x*ks0`` which is wrong
2. rename sympy_symbol to sympy_index_symbol to make it explicit.
3. add assertion that replaced expression is not passed as string but always a sympy expression.
Fixes https://github.com/pytorch/pytorch/issues/117757
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150
Approved by: https://github.com/ezyang
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.
This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.
This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.
While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116897
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/voznesenskym
Many of our pattern matching replacements are specified as a `search_fn` and a `replacment_fn`. The search_fn's are traced out once with static shapes, converted to a pattern, and then matched on every graph compiled with inductor.
The static shape patterns would not match with graphs that are traced out with dynamic shapes because SymInts would be added to the graph as `sym_size` fx nodes which added additional uses and prevented matching. The previous PR partially addresses this by deduping SymInts that are resolvable to graph inputs, as is the calling convention in aot autograd.
This PR adjusts our matching of the `search_fn` by adding SymInts to the arguments we trace out the search_fn with so that their symint accesses are deduped. Later, if we have a match, we will trace out the replacement graph with the correct Tensors and corresponding symbolic shapes that will get added to the graph.
Note: the replacement patterns will insert sym_size uses which could potentially be removed, but I'll leave that for follow up.
Fix for https://github.com/pytorch/pytorch/issues/111190.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115441
Approved by: https://github.com/jansel
ghstack dependencies: #116158
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.
So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.
I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.
This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.
This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.
While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116358
Approved by: https://github.com/jansel
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
I used a couple of type-ignore comments in ir.py because it constructs
short-lived instances of FixedLayout and GraphModuleSerializer, just to
call a single method on them that doesn't use all their members. Making
those unused members optional would make the rest of the code a lot
messier with sprinkled `assert` statements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113534
Approved by: https://github.com/albanD
ATT, there are cases where multiple kernel invocations have same kernel names, and key_averages() will wrongly get average results across different invocations. This fix uses cuda_time_total / n_repeat instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113611
Approved by: https://github.com/chenyang78
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.
> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.
Even further, we also avoid the overhead of building the unnecessary set object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.
This diff implements static memory planning. It's disabled by default
while we examine its performance.
We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.
Some limitations:
1. It is only enabled during inference for now. Enabling it for training
increases peak memory usage as we allocate all the memory needed for
training upfront, before freeing the memory allocated during
inference. We can probably address this by doing planning for both
the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
AllGatherIntoTensor codegen strings which do memory operations. We
can fix this down the line by having them emit MemoryPlanningLines
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.
This diff implements static memory planning. It's disabled by default
while we examine its performance.
We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.
Some limitations:
1. It is only enabled during inference for now. Enabling it for training
increases peak memory usage as we allocate all the memory needed for
training upfront, before freeing the memory allocated during
inference. We can probably address this by doing planning for both
the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
AllGatherIntoTensor codegen strings which do memory operations. We
can fix this down the line by having them emit MemoryPlanningLines
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
This is a small fix in "do_bench_using_profiling()".
When CUDA kernels are executed in a non-default CUDA stream, if cuda.synchronize() is called, a CUDA kernel named "Context Sync" will be launched to the default stream to wait until all other streams are finished. This CUDA kernel has "CUDA time" but is not a real kernel to profile. This fix excludes "Context Sync" when calculating kernel total time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112223
Approved by: https://github.com/int3, https://github.com/chenyang78
Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code.
Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020
Approved by: https://github.com/desertfire
Improves perf of llama_v2 locally from 1.55 -> 1.57
The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise.
Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was:
```
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin):
iota = torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False)
# File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0)
position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]); unsqueeze = None
# The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
```
Also not sure if I should be more worried about concatting reduction->pointwise inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233
Approved by: https://github.com/Chillee
Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762.
Main algorithm:
1. Greedily moves waits as late as possible (i.e. until we reach a use)
2. Greedily moves comms as early as possible (i.e. until we reach an input)
3. Move computes following simple heuristics to improve overlap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091
Approved by: https://github.com/Chillee, https://github.com/wanchaol
Fixes#111066#111065#111064
Currently use_cutlass_template is returning True on ROCm but the feature is not supported. Fix to return false on ROCm. I considering adding this change to `try_import_cutlass` instead but the comments hinted that this function would be removed at some point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111132
Approved by: https://github.com/jansel
Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.
Reviewed By: bertmaher, muchulee8, mikekgfb
Differential Revision: D49800712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409
Approved by: https://github.com/bertmaher, https://github.com/desertfire
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
return (a * b).sum(dim=-1)
N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```
This generates code like this: http://ix.io/4HFs
```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel