> capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream.
Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc,
may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for
actions in the current thread, and "relaxed" will not error on these actions.
Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads.
Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407
Approved by: https://github.com/albanD, https://github.com/eqy
Working as starter task with @Chillee
This PR adds a method under BaseSchedulerNode to estimate the node's runtime in seconds.
We use a heuristic based approach, first by considering whether the operation is memory bandwidth bounded or compute bounded:
- memory bandwidth bounded: we compute the number of bytes that are read/written to
- compute bounded: we compute the FLOPS required by the operation
One use case could be to be used as a cost model for scheduling: https://github.com/pytorch/pytorch/pull/100762
```
(pytorch-3.10) [14:08:02] ~/local/pytorch (xmfan/estimate_snode_runtime) > python3 test/inductor/test_perf.py -k EstimateSnodeRuntimeTests
[(ExternKernelSchedulerNode(name='buf0'), 400)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)]
.[(ExternKernelSchedulerNode(name='buf0'), 3000), (SchedulerNode(name='buf1'), 3000)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26), (SchedulerNode(name='buf1'), 7.187055238190188e-09)]
.[(ExternKernelSchedulerNode(name='buf0'), 3000)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26)]
.[(ExternKernelSchedulerNode(name='buf0'), 34600)]
[(ExternKernelSchedulerNode(name='buf0'), 3.22687496698039e-24)]
.[(ExternKernelSchedulerNode(name='buf0'), 396)]
[(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)]
.[(ExternKernelSchedulerNode(name='buf0'), 396)]
[(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)]
.[(ExternKernelSchedulerNode(name='buf0'), 7776176)]
[(ExternKernelSchedulerNode(name='buf0'), 4.63240241413653e-21)]
.[(FusedSchedulerNode(nodes=buf0_buf1), 210)]
[(FusedSchedulerNode(nodes=buf0_buf1), 5.030938666733132e-10)]
.[(ExternKernelSchedulerNode(name='buf0'), 300)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)]
.[(SchedulerNode(name='buf0'), 20)]
[(SchedulerNode(name='buf0'), 4.7913701587934585e-11)]
.
----------------------------------------------------------------------
Ran 10 tests in 14.311s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106426
Approved by: https://github.com/Chillee
Recently I feel it's a bit painful to run benchmark scripts on my dev environment. E.g., the command below
```
python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only YituTechConvBert --training
```
took about 2 minutes to run. It may take even longer for some other models.
The command is slow since it
- need do dynamo work
- verify the model on CPU
- run perf tests
- compile all the graphs
However, often times I only need to debug inductor specific logic like loop ordering and fusion. A lot of the things the script is done are useless for me. Also I only need test one graph at a time (e.g. check fwd graph first and when I'm done, continue to check bwd graph) rather than compiling all the graphs.
The graph replayer add a `@save_args` decorator to compile_fx_inner function. When `config.save_args` is true, it will pickle all the arguments to `comple_fx_inner` to the file system. Later on, we can call `load_args_and_run_compile_fx_inner("/tmp/inductor_saved_args/compile_fx_inner_0.pkl")` to replay the graph and compile it with inductor.
Replaying the fwd graph took around 60 seconds (maybe this can be further reduced but this is already 2x speedup for dev efficiency) , and it only took around 20 seconds to reach `Scheduler.__init__` method.
I also checked `TORCH_COMPILE_DEBUG` flag that already exists. The most similar part of `TORCH_COMPILE_DEBUG` is it can save a graph and it's arguments and later on rerun it. But the difference here is, rather than run the model, we want to call inductor API to compile the model (without even going thru dynamo or aot-autograd).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106952
Approved by: https://github.com/jansel
ghstack dependencies: #106990
https://github.com/pytorch/pytorch/issues/105555
Existing flow first exports and then calls torch._inductor.aot_compile. However, export calls aot_autograd with the core aten decomposition table, and then torch._inductor.aot_compile calls aot_autograd again with the inductor decomposition table. The 2nd calling of aot_autograd is supposedly causing some problems, and seems excessive, so instead we will create a new function, torch._export.aot_compiler which will export using the inductor decomposition table, pass it to inductor's compile_fx_aot, and because it has already been exported, avoid recalling aot_autograd.
```
def aot_compile(
f: Callable,
args: Tuple[Any],
kwargs: Optional[Dict[str, Any]] = None,
constraints: Optional[List[Constraint]] = None,
) -> Tuple[str, ExportedProgram]:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105977
Approved by: https://github.com/desertfire, https://github.com/zhxchen17, https://github.com/eellison
Freezing will take parameters and turn them into constants. A couple changes here:
- move the setting of `flat_params[dropped_index]` before cpp compilation so that cpp_wrapper knows they have been dropped
- compile_fx_aot is doesn't use aot_autograd for invocation, so we no longer add the wrapper which discards dropped param indices. Continuing to add arguments everywhere didn't seem great, so I added `_in_aot_compilation`, but maybe reviewers would prefer something else.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105497
Approved by: https://github.com/desertfire
This branch:
1) converts the autograd tape into an FX graph
2) caches that conversion using a "shadow" graph
3) compiles and runs the generated FX graph instead of the normal autograd
What works currently:
1) Caching, capture, and initial integration
2) Backwards hooks
3) Inlining AotAutograd generated subgraphs
4) torch.compiling the generated FX graph
5) Auto-detecting dynamic shapes based on changes
Future work
1) Larger scale testing
1) Boxed calling convention, so memory can be freed incrementally
1) Support hooks on SavedTensor
1) Additional testing by running eager autograd tests under compiled_autograd.enable()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103822
Approved by: https://github.com/ezyang, https://github.com/albanD
Summary: This is a follow-up on https://github.com/pytorch/pytorch/pull/105496. There are several issues with the previous fix,
1) It explicitly does copy for every output at the end of the main function;
2) When an output is ReinterpretView, no as_strided was generated for it;
3) There can be duplicated buffer declarations.
This PR fixes by making sure can_reuse behave consistently between two AOTIndcutor passes, and thus always generate the same set of kernels. It also adds handling of ReinterpretView.
Differential Revision: [D47692214](https://our.internmc.facebook.com/intern/diff/D47692214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105773
Approved by: https://github.com/jansel
**TL;DR**: if lowerings.py encounters aten.index_put, it will set V.graph.cudagraphs_okay = False, which will disable cudagraphs. index_put needs to be disabled because it crashes cuda graphs.
index_put_ fallbacks fail with cuda graphs when `accumulate=True` - likely for the same reason that it fails with deterministic_algorithms_enabled:
fcb7d4b358/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L730)
A first attempt was just to expand the scenarios where `index_put_` is one of the disallowed kernels in utils.py: 2fa7d11b64/torch/_inductor/utils.py (L436-L438)
However this disables cuda graphs in too many scenarios, because index_put doesn't cause issues if it gets fused, it only causes issues if the aten kernel gets called. So in the updated version of this PR, we check for fallbacks in lowerings.py and disable cudagraphs only if a fallback is encountered there.
Example of failure outside of PT2:
```python
import torch
def fn(x, y, z):
x = torch.zeros_like(x)
return x.index_put_([y], z, True)
# return x + 1
x = torch.zeros((512, 512), dtype=torch.bool, device='cuda')
y = torch.arange(512, dtype=torch.int64, device='cuda')
z = torch.ones((512, 512), dtype=torch.bool, device='cuda')
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
for i in range(3):
fn(x, y, z)
torch.cuda.current_stream().wait_stream(s)
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
fn(x, y, z)
```
fails with
```
Traceback (most recent call last):
File "/data/users/dberard/scripts/graphed_index_put.py", line 24, in <module>
fn(x, y, z)
File "/data/users/dberard/scripts/graphed_index_put.py", line 8, in fn
return x.index_put_([y], z, True)
RuntimeError: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/users/dberard/scripts/graphed_index_put.py", line 24, in <module>
fn(x, y, z)
File "/data/users/dberard/pytorch/torch/cuda/graphs.py", line 173, in __exit__
self.cuda_graph.capture_end()
File "/data/users/dberard/pytorch/torch/cuda/graphs.py", line 79, in capture_end
super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
Differential Revision: [D47538548](https://our.internmc.facebook.com/intern/diff/D47538548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105439
Approved by: https://github.com/eellison
Summary:
until we can further investigate the autotuning differences between MAST and non-MAST (devserver) environments, turn off the global cache for all non-MAST environments. this ensures we don't see unexpected regressions
also update scuba logging for cache lookup, and add scuba logging for autotuning results.
Test Plan: sandcastle + CI
Differential Revision: D47516633
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105375
Approved by: https://github.com/jansel
issues resolved: https://github.com/pytorch/pytorch/issues/101832
**context**: get torch.compile config for further usage. E.g, the training platform wants to get if model is compiled with cudagraph enabled and trigger further action
**how it is implemented**
* the core logic is backend.get_compiler_config() in torch/_dynamo/eval_frame.py
* for backend='inductor' / _TorchCompileInductorWrapper, we have inductor-specific implementation in get_compiler_config in torch/_inductor/compile_fx.py and torch/__init__.py
**how to use it**: Below is an example.
```
model = DummyModule()
optimized_module = torch.compile(
model, options={"triton.cudagraphs": True}
)
compiler_config = optimized_module.get_compiler_config()
if compiler_config["triton.cudagraphs"]:
pass
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105026
Approved by: https://github.com/yanboliang, https://github.com/jansel
Previously, x.size(0) could return a SymInt, even when the internal
sympy expression was actually already constant (e.g., due to an
introduced guard.) We now allow to query the Python object with
maybe_as_int which allows us to transmute these objects back to
int when possible.
It is still possible to end up with a constant SymInt even after this
change, e.g., if you get out a SymInt and while holding onto it
specialize it, but casual users are more likely to get ints when they
want to.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828
Approved by: https://github.com/Skylion007
The general idea is to do a separate CUDA graph for each size. Because of cuda graph trees, these graphs will all share the same memory pool, so your memory usage will only be the worst case memory usage of the biggest dynamic size you want. This requires an extra dispatch in the cudagraphified callable. You must pay for a CUDA graph recording for every dynamic size you encounter, but this is MUCH cheaper than running the entire PT2 compile stack, so I expect you to still see benefits.
This was surprisingly easy to do.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105064
Approved by: https://github.com/voznesenskym
Previously, x.size(0) could return a SymInt, even when the internal
sympy expression was actually already constant (e.g., due to an
introduced guard.) We now allow to query the Python object with
maybe_as_int which allows us to transmute these objects back to
int when possible.
It is still possible to end up with a constant SymInt even after this
change, e.g., if you get out a SymInt and while holding onto it
specialize it, but casual users are more likely to get ints when they
want to.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828
Approved by: https://github.com/Skylion007
This prefigures a refactor that will move the backward compilation
to entirely ahead of time, so I need to extract these strides some
other way. Straight from the compiler's mouth will do it.
I can't easily get the information via the return result of `fw_compiler` without changing the calling convention, so instead I smuggle it via TracingContext. TracingContext may be None when we are compiling patterns for the joint graph pattern matcher.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105010
Approved by: https://github.com/shunting314
Summary:
Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.
Test Plan:
$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v
(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok
----------------------------------------------------------------------
Ran 1 test in 13.415s
OK
Reviewers: @eellison
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
Some notes:
* I now manually turn off `_generate` jobs from running with cudagraphs, as it is unrealistic to expect to cudagraph autoregressive generation up to max sequence length, this would imply compiling the entire unrolled sequence generation. Concretely, cm3leon_generate was timing out post this change, likely due to the compile time slowdown of dynamic shapes ON TOP OF accidentally unrolling all the loops
* A few torch._dynamo.reset tactically inserted to force recompiles on tests that expected it
* expectedFailureAutomaticDynamic flip into patching automatic_dynamic_shapes=False
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103623
Approved by: https://github.com/voznesenskym
This PR handles inference. Will do similar thing for training later.
Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one).
- convmixer: 4.285x -> 4.309x
- resnet50: 2.170x -> 2.203x
The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing.
Commands
```
TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103642
Approved by: https://github.com/eellison
Adds Conv-BN folding to inductor freezing. One thing that's a little awkward now is we'll want different decompositions to run depending on if we are in the inference compiler. For now, I require that you run with torch.no_grad() so we can detect if no gradients are required before calling aot_autograd.
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100653
Approved by: https://github.com/jansel
Adds a freezing pass that will constant fold parameters in inductor `config.freezing`. This occurs post functionalization in aot autograd to capture both dispatching and allow passes to occur post functionalization. A few notes:
- There is an option to discard parameters `config.freezing_discard_parameters` which will take the current eager modules and wrap parameters to a Tensor subclass which will error if used.
- I needed to expose flat_params in aot_autograd in order to discard old references when we constant fold away parameters, like with amp. I also exposed `fw_metadata` to avoid constant folding mutated paraemters.
- Caching parameter transformations/constant folding across different inferences nyi
- Checking version_counter of constant folded params nyi
I'm not really sure what the actual naming should be. In jit there was both "freezing", which was platform agnostic, and "optimize for inference", which made device specific optimizations. We're doing the latter here but maybe freezing is a better name.
Differential Revision: [D46244033](https://our.internmc.facebook.com/intern/diff/D46244033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100652
Approved by: https://github.com/jansel
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs. This new strategy
I think is better. The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input. When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.
This obsoletes https://github.com/pytorch/pytorch/pull/101813
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym, https://github.com/eellison
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs. This new strategy
I think is better. The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input. When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.
This obsoletes https://github.com/pytorch/pytorch/pull/101813
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym
Summary: bias_addmm is not backed up by a cpp funciton, so turn
autotune_cublasLt for cpp_wrapper + max_autotune. We can add a cpp
function implementation if there is a performance need.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103004
Approved by: https://github.com/jansel
Key change - seed, offset are the last 2 args in both the fwd and bwd graphs
Reason - The cudagraphs implementation in inductor currently relies on very simple ordering guarantees i.e. first n inputs are static for both fwd and bwd graphs. In the current implementation of functionalization of rng ops, this assumption is broken because the first 2 inputs are seed, offset.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102344
Approved by: https://github.com/eellison