This PR introduces an `aot` flag to standalone_compile that uses BundledAOTAutogradCacheEntry, and then allows regional_inductor to use this so that we can start aot compiling regional compiler graphs. The diff above this will attempt to allow GraphPickler to fully serialize graphs that have regionally compiled subgraphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165843
Approved by: https://github.com/oulgen
Fixes#165447
On AOTAutogradCache load, the serialization function we pick is just lambda: self, because the object itself is an AOTAutogradCacheEntry. However, this isn't safe, because `wrap_post_compile` will make `self` unserializable, since it needs to load triton kernels and stuff!
So instead, on AOTAutogradCache load, we preserve the bytes that were used to load the object to begin with, and return that object on a call to serialize(). This effectively makes it so that we save a copy of the pre-hydrated artifact, without needing to do an eager copy until someone actually calls `serialize`.
Test Plan:
Run
```py
import torch
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear1 = torch.nn.Linear(2, 4)
self.relu = torch.nn.ReLU()
self.linear2 = torch.nn.Linear(4, 8)
def forward(self, x):
return self.linear2(self.relu(self.linear1(x)))
device = "cuda"
m = M().to(device)
sample_inputs = (torch.randn(2, 2, device=device),)
eager_out = m(*sample_inputs)
with torch._dynamo.config.patch("enable_aot_compile", True):
compiled_fn_path = "./m.pt"
compiled_fn = torch.compile(
m,
fullgraph=True
).forward.aot_compile((sample_inputs, {}))
compiled_fn.save_compiled_function(compiled_fn_path)
torch._dynamo.reset()
with torch.compiler.set_stance("fail_on_recompile"):
with open(compiled_fn_path, "rb") as f:
loaded_fn = torch.compiler.load_compiled_function(f)
assert loaded_fn is not None
compiled_out = loaded_fn(m, *sample_inputs)
assert torch.allclose(eager_out, compiled_out)
```
twice, see that it succeeds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165474
Approved by: https://github.com/yiming0416, https://github.com/zhxchen17
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`
Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta | +990 | +9 | +44.43% | +155 | 0 | +42.82% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397
Approved by: https://github.com/anijain2305
We assumed that the output in an FX graph would always just be a
list[Tensor], even in the single tensor return case.
It is possible for the output to be a single Tensor. This can happen
by calling torch.fx.split_module on the module.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157803
Approved by: https://github.com/oulgen
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.
One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
standalone_compile needs to get dynamic shape information from
somewhere. We add a new `dynamic_shapes` argument with three options:
1. from the passed-in graph (dynamic="from_graph"). This is the default.
2. from the example inputs, thereby specializing on them. (dynamic="from_example_inputs")
3. from the current tracing context (dynamic="from_tracing_context")
1 and 3 are not exactly the same. 2 can also be used for more advanced
things... (specialize on one input but not the other).
Most of this PR is tests.
Test Plan:
- a lot of new tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151788
Approved by: https://github.com/oulgen
This somewhat complicated PR does a few things:
- It separates out a lot of the guard checking logic into its own class, GuardedCache[T]
- It adds a new `check_guard_hit` lambda to FXGraphCache._lookup_graph, which allows callers to define their own guard checking logic
- It then uses these two combined parts to lift guard checking to AOTAutogradCache. This means that AOTAutogradCache stores its own guard expressions and evaluates them.
- FXGraphCache's guard checking logic is completely unchanged, just refactored. As part of the work, I'm able to extend a bit of the logging functionality of AOTAutogradCache into FXGraphCache, so that you can know if FXGraphCache missed due to a guard failure or a full cache miss.
# Why do this?
Lifting guards to AOTAutogradCache has a few benefits:
- First, it fixes a long standing bug in guard checking logic. Backward passes can have different symint inputs than forward passes depending on forward output, if AOTAutograd chooses to store symints for the backward. These symint inputs have the same underlying symbols as the forward, but on AOTAutogradCache hit, we don't have access to the hints backing these exact symints (we only have hints for the symints on the forward function). By lifting guard checking logic to AOTAutogradCache, we no longer need to check the backward guards, as they'll be included in the AOTAutogradCache guard expression. **I've added a unit test that failed before my diff, and now passes, as an example of this**
- Secondly, this is the first step necessary to bundle CompiledFxGraph into AOTAutogradCache. Doing so will simplify our cache logic significantly, and also make precompile logic simpler, as precompiles will only need to store AOTAutogradCacheEntrys, without needing to match them up with inductor FXGraphCache entries.
- Finally, adding guard checking logic to AOTAutogradCache my allow us in the future to handle more complicated cases like a single forward with multiple backwards, as guard checks are now storable on the cache entry itself.
# Guard checking logic of AOTAutogradCache
When AOTAutogradCache evaluates guard expressions, it no longer needs to evaluate the forward/backward guards in the FXGraphCacheEntry (since the AOTAutogradCache guard expressions will encompass them). Because of this, we still need a way for AOTAutogradCache to distinguish between multiple FXGraphCache local entries. To do so, AOTAutogradCache stores the guard string from FXGraphCache, which it uses as a second "cache key". It doesn't need to **evaluate** these guards, it just needs to find the cache entry from FXGraphCache that had the same guards as when it was stored.
After this, I will work on putting the FXGraphCache entries directly into AOTAutogradCache. If I can put CompiledFxGraphs in the cache directly, I no longer need this complicated `check_guard_hit` overriding logic.
## Test Plan
Added a new unit test. There are comprehensive guard checking unit tests in `test_aot_autograd_cache` already, and those pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151563
Approved by: https://github.com/oulgen
This PR fixes two things.
The first problem is that in the vLLM style standalone_compile is
called from within a custom torch.compile backend. If there already is a
FakeTensorMode (which there is), we shouldn't create a new
FakeTensorMode with the same shape_env, instead we should just reuse the
same FakeTensorMode.
The second thing is that compile_fx can mutate the passed in gm, so we
deepcopy (since standalone_compile should be standalone)
Test Plan:
- new test
- updated old tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151502
Approved by: https://github.com/oulgen
ghstack dependencies: #151501, #151551
We were only returning the first one. There's an edge case on what to do
if the original function returns a single Tensor. capture(f) returns a
function that returns a tuple of one Tensor in this case and we were
originally converting this back to one single Tensor. I think it's fine
to return a tuple of one Tensor (that is what the graph passed to
standalone_compile asked for!) but we can revisit.
fine
Test Plan:
- modified one test to used multiple outputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151551
Approved by: https://github.com/Skylion007, https://github.com/oulgen
ghstack dependencies: #151501
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
This PR adds standalone_compile API that does precompilation via caching to support vLLM use case in the short term while we work on the longer term precompilation solution.
```
standalone_compile(gm, example_inputs, options) -> CompiledArtifact
CompiledArtifact.save(path, format: binary|unpacked = binary)
CompiledArtifact.load(path, format: binary|unpacked = binary)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150670
Approved by: https://github.com/jamesjwu, https://github.com/zou3519