This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file.
It uses the same idea with the autotuning result cache that is currently live.
This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor.
Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669
Approved by: https://github.com/jansel, https://github.com/eellison
Summary:
previous attempts don't work eventually. D49720297 causes online train SEV due to extra importing. D56299408 mitigates a tricky bug from Distributed Shampoo constructor but unfortutenaly didn't correct the scuba logging either.
see f552546983
Test Plan: {F1491621504}
Differential Revision: D56378270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124593
Approved by: https://github.com/anijain2305
When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned.
In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone.
Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319
Approved by: https://github.com/eellison
Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature.
Test Plan: ci
Differential Revision: D56481353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754
Approved by: https://github.com/eellison
Two small fixes:
- preserve rng around compile_fx_inner
- Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123229
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642
# PR
This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details.
# Note on Optimistic Mutation Check
To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs.
Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster.
- To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check.
- To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes.
Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is:
- [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`.
- [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function.
- [Slow Path] Before `record_function`, we run `check_for_mutation` again.
**Q1: Would there be overhead for type a,b,c,d?**
A: No. We only check input mutation types for the first invocation of a function on a node.
**Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future?**
A: Yes. This is done by `check_invariants` and guarantees the correctness.
**Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future?**
A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231
Approved by: https://github.com/eellison
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.
By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758
Approved by: https://github.com/jansel
Given the following code/dynamo graph:
```
class GraphModule(torch.nn.Module):
def forward(self, L_x_ : torch.Tensor):
l_x_ = L_x_
_print = torch.ops.aten._print('moo')
res = l_x_ + l_x_; l_x_ = None
_print_1 = torch.ops.aten._print('moo')
return (res,)
```
AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output:
```
class <lambda>(torch.nn.Module):
def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"):
with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo'); arg0_1 = None
getitem: "f32[0]" = with_effects[0]; with_effects = None
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None
with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None
getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None
return (getitem_2, add)
```
However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators.
This has to be done *after* the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph.
```
class <lambda>(torch.nn.Module):
def forward(self, arg1_1: "f32[2, 3]"):
_make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default()
with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo'); _make_dep_token_default = None
getitem: "f32[0]" = with_effects[0]; with_effects = None
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None
with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None
getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None
_sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,)); getitem_2 = None
return (add,)
```
When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like:
```
def call(args):
arg1_1, = args
args.clear()
assert_size_stride(arg1_1, (2, 3), (3, 1))
# Source Nodes: [_print], Original ATen: []
buf2 = aten._print.default('moo')
# Source Nodes: [_print_1], Original ATen: []
buf3 = aten._print.default('moo')
buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
cpp_fused_add_0(arg1_1, buf4)
del arg1_1
return (buf4, )
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347
Approved by: https://github.com/bdhirsh
If any graph input has overlapping memory, inductor disables cudagraphs. But the function `complex_memory_overlap` detecting memory overlap can have false positive.
E.g. for tensor `rand_strided((8, 1500, 1), (1504, 1, 1), device=self.device)` the function reports overlapping previously.. This is caused by size=1 dimension. The fix is to do squeeze before running the detection algorithm.
This fixes the perf regress for hf_Whisper and timm_efficientdet when we do padding. For these models cudagraphs were dynamically disabled when doing padding due to the issue discussed here and cause perf regress.
This may help the dashboard if this is a common thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123327
Approved by: https://github.com/Chillee
Existing `innermost_fn` handling of `functools.wraps` is not ideal, but I'm not sure if there's a good fix. This can manifest for GmWrapper (used to handle list inputs from Dynamo -> AOTAutograd) where we don't call the unflatten wrapper at runtime.
Since core parts of Dynamo rely on attribute check for `_torchdynamo_orig_callable`, so I'm adding a test to cover it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123007
Approved by: https://github.com/jansel
ghstack dependencies: #122691, #122746
**Motivation**: https://github.com/pytorch/pytorch/issues/112771
**Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.
Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.
**Tests** https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.
**Alternatives/RFC**:
* Is this the right thing to do with cudagraphs?
* Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)
Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158
Approved by: https://github.com/ezyang
Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file).
AOT Inductor support will be added in a follow-up PR.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 38 tests in 159.387s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069
Approved by: https://github.com/jansel, https://github.com/eellison
Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903
Approved by: https://github.com/eellison
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78
Summary: This is in preparation to enable FX graph caching by default. First fix some bugs uncovered by running all unit tests under `test/inductor/`. I'll enable in a separate diff in case we need to revert. Summary of changes:
* Turn off caching for tests that require a compilation, e.g., when checking that a relevant counter was incremented
* Bypass caching when we see mkldnn tensors as constants (they currently don't serialize, so we can't save to disk)
* Include various global settings that could affect compilation in the cache key calculation.
* Handle a few config settings that break key calculation.
* Handle code paths where no ShapeEnv is available (the cache impl requires a shape env as part of handling guards)
* Skip caching when freezing is enabled (Freezing can embed constants that wouldn't be static across runs).
* Fix the clear() method to not throw when the cache /tmp dir doesn't exist
Test Plan: Ran all tests under `test/inductor/` twice with TORCHINDUCTOR_FX_GRAPH_CACHE=1 to exercise any test that might be affected by caching.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117888
Approved by: https://github.com/eellison
Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`.
In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph.
For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 26 tests in 59.252s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665
Approved by: https://github.com/eellison
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit
How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.
https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
ghstack dependencies: #120712
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit
How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.
https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.
Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.
The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.
Test Plan: Unit tests included in commit.
Differential Revision: D53274382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78