Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed).
This PR extends buffer removal to inplaced buffers that can be removed.
Generated kernel for e.g. ln changes from
```
def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
where in_out_ptr0 is unused in the kernel to
```
def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
and corresponding allocation/reuse lines in the wrapper are removed.
The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289
Approved by: https://github.com/jansel
Fixes cpp wrapper support for kernels that are not exposed in `torch.ops.aten`. The current PR limits the support scope to `repeat_interleave.Tensor` and will submit follow-up PRs for more OPs.
The PR maps the python schema of the kernel to the cpp schema and uses `c10::Dispatcher::singleton().findSchemaOrThrow` to find the corresponding cpp OP.
The current support is limited and will raise `AssertionError` for unsupported cases.
The limitation includes:
- only support kernel that is not alias
- only support kernel the args and returns of which don't have `alias_info`
- only support output args to be a `Tensor`
- only support input args to be `Tensor`, `Optional[int]`, `Optional[float]` and `Optional[bool]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100788
Approved by: https://github.com/jgong5, https://github.com/desertfire
Fixes#100314
In dependencies, we should track not only immediately used buffer, but also aliased buffers that point to it, otherwise we can reuse and overwrite the buffer while there are still pending uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100332
Approved by: https://github.com/jansel
This is a two part PR; I can split it if you really want me to.
The first part is a refactor of the after aot repro/minifier scripts to come with a command line interface. I maintain exact BC with the previous interface (so, e.g., you still get a repro.py and a run_minifier.py that do the same thing as before), but each of these scripts also take command line arguments now which you can use to customize what actually happens. Check `run_repro` for full documentation on the arguments.
The second part of this is an implementation of `analyze` subcommand on the new CLI for any repro.
<img width="1277" alt="image" src="https://user-images.githubusercontent.com/13564/235045677-8545aab7-5e83-4813-bbec-47783dc60122.png">
This facility is oriented towards accuracy debugging. It does several things:
1. It will run your model twice and check for nondeterminism in inductor/float64, *even* on intermediate inputs (our benchmarking nondeterminism test only checks for nondeterminism on the final output). This makes localizing which operator is nondeterministic easy.
2. It will run your compiled model side-by-side with eager and float64 variants, and then report when things diverge too far from RMSE delta from float64.
Importantly, it does all this without requiring every intermediate to be held in memory (which will cause an OOM on large repros, such as the one I tested this on.)
Some other minor improvements:
* MinifierTestBase now has an easy to comment out spot that you can use to retain the temporary directory; good for debugging
* We print "running minifier" and "running repro" in MinifierTestBase to make it easier to orient where logs are coming from
* same takes a `log_error` optional argument which you can use to reroute the error logs when things mismatch
* counters["inductor"]["intermediate_hooks"] tracks the number of intermediate hooks we've codegen'ed; good for populate the tqdm interface
* torch.fx.interpreter gets an official `boxed_run` interface which uses the boxed arguments calling convention and doesn't retain inputs unnecessarily long
* torch.utils._content_store gets compute_tensor_metadata/read_tensor_metadata helper functions for computing tensor information without serializing it
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100226
Approved by: https://github.com/bertmaher, https://github.com/bdhirsh, https://github.com/anijain2305
Currently, we track 'origins' on IR nodes so that we have some idea about what FX IR nodes contributed to any given fused kernel. However, the origins are dumped into an undifferentiated set, so if you have, e.g., multiple outputs, you cannot easily tell which output corresponds to which FX node.
This PR introduce a more precise notion of tracking "origin_node" which says that the contents of this Buffer/Loop node corresponds EXACTLY to the output of a particular FX node; e.g., if you serialized each intermediate when running the generated inductor code, you could compare them with the corresponding intermediates from the original FX graph.
Tracking origin_node in all cases requires quite a bit of effort, so this PR introduces the tracking on a strictly best effort basis. The logic in torch/_inductor/graph.py sets up the associations, but only when it is "obvious" which IR node should get the assignment, and there is work in torch/_inductor/ir.py for propagating this information around as necessary. Like origins, origin_node is not a true dataclass field (as this would break all existing positional arg call sites), instead, it is added post facto via `__post_init__`. At the moment, it is only valid for Buffer/Loop to have an origin_node, but we could imagine relaxing this in the future.
The payoff is in torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/triton.py where we currently just print the FX node name and the tensor (but a more useful integration will be coming later.)
I also introduce a debugging tool `debug_ir_traceback` which tracks tracebacks of where IRNodes were allocated, to help you understand why a node doesn't have an `origin_node`.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100110
Approved by: https://github.com/voznesenskym
The python function `benchmark_compiled_module` ends up using C++ expression printer to print the size for `rand_strided`, so you get a set e.g. `{2, 17}` instead of a
tuple `(2, 17)`. Here is a complete example from master:
```python
def benchmark_compiled_module(times=10, repeat=10):
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
arg0_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.float32)
arg1_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.uint8)
return print_performance(lambda: call([arg0_1, arg1_1]), times=times, repeat=repeat)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98608
Approved by: https://github.com/ngimel
Summary: This is a reland of #98264.
When _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98534
Approved by: https://github.com/huydhn
Summary: when _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98264
Approved by: https://github.com/ngimel
1. Fixed dynamic shapes support in cpp_wrapper
- fixed the cpp codegen of `size()` and `stride()`
- fixed the cpp codegen of `ShapeAsConstantBuffer`
- changed to use `cexpr` instead of `pexpr` in the cpp codegen of the `sizevar`
2. Enabled dynamic shapes tests for cpp_wrapper
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97965
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary:
This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make
the landing easier.
This PR implements a two-pass wrapper codegen for the Triton
backend to achieve ahead-of-time compilation. In the first pass, the
regular python wrapper code will be generated, and then the generated
code will be executed to perform Triton compilation and autotuning.
After that, the second pass wrapper codegen will generate C++ wrapper
with proper CUDA API to load and launch Triton-generated CUDA kernels.
Like the AOT mode for the cpp backend, the next step would be to provide
a more complete API for AOT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214
Approved by: https://github.com/eellison
Following metrics should be helpful:
- percent of time GPU is busy
- percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes
- percent of time each individual kernel takes compared to total wall time of the benchmark
This PR add those.
Example result from hf_Bert infernece graph:
```
== triton_pointwise category kernels ==
Kernel Self CUDA TIME (ms) Count Percent
------------------------------ --------------------- ------- ---------
triton_poi_fused_gelu_6_0d1d 0.48154 12.0 5.52%
triton_poi_fused_clone_1_0d1d2 0.29011 24.0 3.33%
triton_poi_fused_clone_2_0d1d2 0.17417 12.0 2.00%
triton_poi_fused_clone_4_0d1d2 0.10797 12.0 1.24%
Total 1.05379 12.08%
== triton_persistent_reduction category kernels ==
Kernel Self CUDA TIME (ms) Count Percent
------------------------------ --------------------- ------- ---------
triton_per_fused__softmax__to_ 0.97188 12.0 11.14%
triton_per_fused_add_native_la 0.37401 24.0 4.29%
triton_per_fused_gelu_native_l 0.02 1.0 0.23%
triton_per_fused_add_embedding 0.01718 1.0 0.20%
Total 1.38307 15.86%
== unknown category kernels ==
Kernel Self CUDA TIME (ms) Count Percent
------------------------------ --------------------- ------- ---------
ampere_fp16_s16816gemm_fp16_12 2.24514 24.0 25.74%
ampere_fp16_s16816gemm_fp16_25 1.39796 49.0 16.03%
void cutlass::Kernel<cutlass_8 1.36093 1.0 15.61%
ampere_fp16_s16816gemm_fp16_64 0.74591 12.0 8.55%
ampere_fp16_s16816gemm_fp16_12 0.61989 12.0 7.11%
Memset (Device) 0.024 12.0 0.28%
void at::native::(anonymous na 0.01543 2.03 0.18%
void at::native::vectorized_el 0.00011 0.03 0.00%
Total 6.40937 73.49%
Percent of time when GPU is busy: 101.44%
```
Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU).
## interesting usages
1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle:
```
(Forward graph) Percent of time when GPU is busy: 32.69%
Total wall time 17.307 ms
```
Its backward graph has less percent of GPU idle time, but it's still high:
```
(Backward graph) Percent of time when GPU is busy: 46.70%
Total wall time 17.422 ms
```
2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ).
NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately.
```
benchmark_name pointwise_percent reduction_percent persistent_reduction_percent unknown_category_percent GPU_busy_percent wall_time_ms
----------------------- ------------------- ------------------- ------------------------------ -------------------------- ------------------ --------------
resnet18 19.73% 7.86% 4.81% 41.25% 73.65% 2.549ms
resnet18 18.59% 7.13% 3.35% 67.35% 96.41% 3.467ms
resnet50 29.57% 22.13% 2.07% 51.68% 105.46% 6.834ms
resnet50 26.42% 15.27% 0.94% 59.68% 102.31% 13.346ms
vgg16 26.23% 0.00% 0.00% 74.20% 100.43% 18.212ms
vgg16 15.63% 5.61% 0.10% 79.42% 100.75% 33.485ms
BERT_pytorch 28.62% 4.82% 14.88% 33.32% 81.64% 7.162ms
BERT_pytorch 14.43% 13.41% 18.19% 49.24% 95.27% 10.395ms
densenet121 11.89% 2.14% 3.86% 16.36% 34.25% 16.531ms
densenet121 10.37% 2.06% 4.09% 31.46% 47.98% 16.934ms
hf_Bert 23.94% 0.00% 29.88% 46.09% 99.90% 7.766ms
hf_Bert 11.65% 10.54% 20.26% 61.66% 104.11% 11.892ms
nvidia_deeprecommender 42.92% 0.00% 0.00% 56.75% 99.67% 3.476ms
nvidia_deeprecommender 31.36% 3.44% 0.46% 65.20% 100.45% 3.872ms
alexnet 30.99% 0.00% 0.00% 69.16% 100.14% 3.169ms
alexnet 24.41% 4.83% 0.17% 71.09% 100.50% 4.709ms
mobilenet_v2 29.21% 27.79% 2.49% 44.00% 103.49% 10.160ms
mobilenet_v2 17.50% 15.05% 1.06% 69.68% 103.29% 20.715ms
resnext50_32x4d 18.96% 9.28% 2.31% 28.79% 59.33% 5.899ms
resnext50_32x4d 18.48% 11.01% 1.86% 53.80% 85.14% 7.167ms
mnasnet1_0 19.07% 14.52% 3.01% 35.43% 72.03% 6.028ms
mnasnet1_0 14.17% 12.00% 1.87% 67.56% 95.60% 9.225ms
squeezenet1_1 38.56% 0.00% 1.77% 56.21% 96.53% 2.221ms
squeezenet1_1 21.26% 7.57% 1.05% 67.30% 97.18% 4.942ms
timm_vision_transformer 17.05% 0.00% 18.80% 65.79% 101.64% 9.608ms
timm_vision_transformer 9.31% 9.07% 10.32% 73.25% 101.96% 16.814ms
```
## how to use
`python {compiled_module_wrapper.py} -p`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723
Approved by: https://github.com/jansel
V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable.
Add 'global var_name' lines to maintain the same behavior as before.
Test:
Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix.
Thanks @ngimel for report the issue with repro and @Chillee for pointing out the root cause.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571
Approved by: https://github.com/ngimel