pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
SherlockNoMad	a44f8894fa	[Inductor] Provenance tracking for wrapper code (#105717 ) Summary: Add comments in wrapper code for better provenance tracking Sample inductor wrapper output: ``` # Source Nodes: [mm_1], Original ATen: [aten.mm] extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1) # Source Nodes: [l__self___linear], Original ATen: [aten.addmm] extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0) ``` in cpp wrapper ``` // Source Nodes: [bmm_1], Original ATen: bmm at::bmm_out(buf0, arg0_1, arg1_1); ``` Test Plan: OSS CI Differential Revision: D47657260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-07-21 23:06:43 +00:00
Shunting Zhang	1e87778552	[inductor] refactor wrapper benchmark code out of utils.py (#105584 ) Refactor wrapper benchmark out of utils.py since 1. utils.py gets too large 2. I plan to add more code to wrapper benchmark for multi-kernel. This is split out from https://github.com/pytorch/pytorch/pull/103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584 Approved by: https://github.com/jansel	2023-07-21 00:01:35 +00:00
Justin Chu	cb7a30f656	[BE] Enable ruff's UP rules and autoformat inductor/ (#105431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431 Approved by: https://github.com/albanD	2023-07-19 13:45:00 +00:00
Edward Z. Yang	2fa7d11b64	Immediately compile backwards graph in AOTAutograd if dynamic shapes (#104971 ) Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104971 Approved by: https://github.com/Chillee	2023-07-17 15:37:17 +00:00
chunyuan	1fdc88f877	Inductor cpp wrapper: fix codegen of FallbackKernel with kwargs (#104575 ) Fix cpp wrapper failure on TorchBench model `hf_Reformer` with `randn`: ``` random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype) ``` For cpp wrapper, when `kwargs` is not empty, for `OpOverloadPacket` kernel, we need to know the exact overload schema to handle the `kwargs` properly when calling the cpp kernel: including finding the correct order of the kwargs and getting the default value for optional args without provided value when calling the function (`layout` in the above case). The current support in this PR is conservative and we'll extend the functionality in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104575 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-15 03:33:44 +00:00
Peter Bell	66fb83293e	[inductor] Add min/max to index propagation pass (#105020 ) This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing into direct indexing expressions. I also add support to the cpp printer for Min/Max and fix the triton printer to support multi-argument Min/Max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020 Approved by: https://github.com/lezcano	2023-07-12 19:03:01 +00:00
lezcano	7ae100628e	Move most SymPy functions to their own file (#104556 ) All these are standalone implementations of some functions and they don't depend on anything else, so we better have them under the `_sympy/` folder on their own Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556 Approved by: https://github.com/ezyang	2023-07-04 03:53:48 +00:00
Shunting Zhang	98f00f881f	[inductor] convert layout of conv weight ahead of time for inference (#103642 ) This PR handles inference. Will do similar thing for training later. Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one). - convmixer: 4.285x -> 4.309x - resnet50: 2.170x -> 2.203x The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing. Commands ``` TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103642 Approved by: https://github.com/eellison	2023-06-28 17:42:32 +00:00
lezcano	5f77be8bbe	Refactor OptimizeIndexing (#100549 ) This PR decouples the logic necessary to compute bounds on variables from the logic that uses this info to perform the strenght analysis on int64 variables. While doing so, it tries to minimize the number of attributes of the class in favour of local variables. This class is now accessible from any `LoopBody` object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100549 Approved by: https://github.com/eellison	2023-06-13 03:31:41 +00:00
Animesh Jain	58d2c66a70	[activation checkpointing] Higher order functional rng op wrappers (#102934 ) Introduces two higher order operators * run_and_save_rng_state - Saves the current rng state and then runs the op. * run_with_rng_state - Runs the op with the rng state supplied as an input Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934 Approved by: https://github.com/jansel, https://github.com/zou3519	2023-06-12 22:54:17 +00:00
Shunting Zhang	86c7652503	[inductor] layout optimization for conv (#99773 ) convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773 Approved by: https://github.com/jansel	2023-06-02 21:08:18 +00:00
Michael Lazos	80f7264804	Foreach kernel codegen in inductor (#99975 ) [design doc](https://docs.google.com/document/d/1JLr5yMAR8TuKW78ixKeqzfDHhcazwxKo_JXQnP_-wyY/edit?kh_source=GDOCS#heading=h.8x4z4mmet3im) Add foreach kernel codegen for a single overload of foreach add in Inductor. Coverage will expand to more ops in subsequent PRs. [example](https://gist.github.com/mlazos/9606fe64100ea2a5ec8265df1739fbe2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99975 Approved by: https://github.com/jansel	2023-05-25 21:48:41 +00:00
Peter Bell	ce42010722	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-24 22:17:32 +00:00
PyTorch MergeBot	5147fe4969	Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 )" This reverts commit `b9721bd705`. Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))	2023-05-23 23:06:21 +00:00
Peter Bell	b9721bd705	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-22 20:39:18 +00:00
Jiong Gong	6f7ebcdcd8	[inductor] enable descriptive name for cpp kernels (#101330 ) This PR enables the descriptive name for cpp kernels similar to the triton kernel name. A new configuration `config.cpp.descriptive_names` is added similar to that of triton. The kernel name follows the format: `cpp_<fused_name>_<id>`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101330 Approved by: https://github.com/XiaobingSuper, https://github.com/jansel	2023-05-16 06:48:11 +00:00
chunyuan	1faef895ca	Inductor cpp wrapper: support sympy.Expr as input (#101257 ) Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257 Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel	2023-05-15 23:57:28 +00:00
Edward Z. Yang	2c786961b7	Towards making torch._inductor.ir typed (#100712 ) This PR just contains some mild gyrations necessary to appease mypy. However, it is not complete; there are a number of legitimate bugs and mistyping that I need to work out before I can actually turn this on. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100712 Approved by: https://github.com/ngimel	2023-05-12 00:07:33 +00:00
Jason Ansel	e3d783c013	[inductor] Cleanup strip_last_size logic (#100305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305 Approved by: https://github.com/ngimel	2023-05-05 23:10:47 +00:00
Angela Yi	3c5ec6af14	Partition modules (#98628 ) Added helper functions to match nodes in the graph that are decomposed from their source (leaf modules, or functional ops), as a result of dynamo tracing. `get_source_partitions(graph: torch.fx.Graph, wanted_sources: List[Any]) -> Dict[Any, SourcePartition]` Args: * graph: The graph we want to partition * wanted_sources: List of sources of nodes that were decomposed from this source. This can be a function (ex. torch.nn.functional.linear) or a leaf module type (ex. torch.nn.Linear) Returns: * Dictionary mapping sources (ex. torch.nn.modules.linear.Linear) to a list of SourcePartitions that correspond to the list of nodes that were flattened from a module of that type. ``` @dataclass class SourcePartition(): # Nodes in a particular partition nodes: List[Node] # Module type module_type: Type # Nodes in the graph that are needed as inputs to the partition input_nodes: List[Node] = field(default_factory=list) # Nodes in the partition that are being used by nodes outside of the partition output_nodes: List[Node] = field(default_factory=list) # Parameters that are being used params: List[str] = field(default_factory=list) ``` Example: Original: ``` x -> linear -> linear -> relu -> linear ``` Traced graph: ``` .graph(): %arg0 : [#users=1] = placeholder[target=arg0] %_param_constant0 : [#users=1] = get_attr[target=_param_constant0] %t_default : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0,), kwargs = {}) %_param_constant1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1, %arg0, %t_default), kwargs = {}) %_param_constant0_1 : [#users=1] = get_attr[target=_param_constant0] %t_default_1 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0_1,), kwargs = {}) %_param_constant1_1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default_1 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1_1, %addmm_default, %t_default_1), kwargs = {}) %relu_default : [#users=1] = call_function[target=torch.ops.aten.relu.default](args = (%addmm_default_1,), kwargs = {}) %_param_constant2 : [#users=1] = get_attr[target=_param_constant2] %t_default_2 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant2,), kwargs = {}) %_param_constant3 : [#users=1] = get_attr[target=_param_constant3] %addmm_default_2 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant3, %relu_default, %t_default_2), kwargs = {}) return [addmm_default_2] ``` Result of `get_module_partitions`: ``` {<class 'torch.nn.modules.linear.Linear'>: [ ModulePartition(nodes=[_param_constant0, t_default, _param_constant1, addmm_default], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[arg0], output_nodes=[addmm_default], params=["_param_constant0", "_param_constant1"]), ModulePartition(nodes=[_param_constant0_1, t_default_1, _param_constant1_1, addmm_default_1], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[addmm_default], output_nodes=[addmm_default_1], params=["_param_constant0_1", "_param_constant1_1"]), ModulePartition(nodes=[_param_constant2, t_default_2, _param_constant3, addmm_default_2], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[relu_default], output_nodes=[addmm_default_2], params=["_param_constant2", "_param_constant3"])], <class 'torch.nn.modules.activation.ReLU'>: [ ModulePartition(nodes=[relu_default], module_type=<class 'torch.nn.modules.activation.ReLU'>, input_nodes=[addmm_default_1], output_nodes=[relu_default], params=[])]} ``` Also added helper function to check if two module partitions are connected: `check_subgraphs_connected(subgraph1: SourcePartition, subgraph2: SourcePartition) -> bool` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98628 Approved by: https://github.com/cccclai	2023-05-03 23:31:56 +00:00
PyTorch MergeBot	34e90b8df1	Revert "[inductor] Cleanup strip_last_size logic (#100305 )" This reverts commit `de7793d577`. Reverted https://github.com/pytorch/pytorch/pull/100305 on behalf of https://github.com/jansel due to causes IMA errors on huggingface ([comment](https://github.com/pytorch/pytorch/pull/100305#issuecomment-1532317310))	2023-05-03 00:42:48 +00:00
Jason Ansel	de7793d577	[inductor] Cleanup strip_last_size logic (#100305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305 Approved by: https://github.com/ngimel	2023-05-02 23:46:26 +00:00
Edward Z. Yang	f093ee1722	Prevent Triton from getting eagerly imported when importing torch._inductor (#100374 ) This makes 'import torch._inductor.utils' go from 3.5s to 2.1s See also https://github.com/openai/triton/issues/1599 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:12 +00:00
Yanbo Liang	08376cc546	[Inductor] Fix rand_like with kwargs device of str type (#99673 ) Fixes #99632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99673 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-21 20:33:14 +00:00
Shunting Zhang	418a9fb9d8	[reland][inductor] coordinate descent tuning upon max-autotune (#99594 ) Reland https://github.com/pytorch/pytorch/pull/97203 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/99594 Approved by: https://github.com/jansel	2023-04-20 19:55:52 +00:00
PyTorch MergeBot	4aedb8e116	Revert "[inductor] coordinate descent tuning upon max-autotune (#97203 )" This reverts commit `52ecc3274b`. Reverted https://github.com/pytorch/pytorch/pull/97203 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks MacOS test in trunk	2023-04-19 02:33:02 +00:00
Shunting Zhang	52ecc3274b	[inductor] coordinate descent tuning upon max-autotune (#97203 ) Command to run max autotune baseline: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Command to do coordinate descent autotuning: ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Explanation of the envvars show up on the command: ``` - TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 : enable coordinate descent tuning - TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 : disable persistent reduction. Need do this so we can tune RBLOCK for reductions - TORCHINDUCTOR_MAX_AUTOTUNE=1: enable max autotune - TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc : use a separate cache dir for coordinate descent tuning. Optional. ``` Here are my experiments results for around 40 torchbench models: https://docs.google.com/spreadsheets/d/1G7i2whIf8Yu-HhN_WovNxwcE-iFDSAw4x3NK4uL4XhI/edit#gid=0 Some highlights - We improve 2.2% further upon max-autotune on average (geomean) - timm_resnest benefits most from coordinate descent tuning. There is 1.07x speedup - We have descent speedup on transformer models - BERT_pytorch: 1.056x - timm_vision_transformer: 1.04x - hf_Bert: 1.030x - For resnet models, it looks like we have less gain as model get larger. My guess is larger model spend more time on mm/conv, so our tuning for pointwise/reduction helps less - resnet18: 1.021x - resnet50: 1.014x - resnet152: 1.005x This kind of coordinate descent autotuning can give us 'upper bound' of the gain we can get for tuning configs for pointwise/reduction. On the other hand, by spot checking, we roughly double the compilation time compared to max-autotune. Next steps can be - we disable persistent reduction in coordinate descent autotune (it's still enabled in baseline) so we can tune RBLOCK for reduction. We can also try to use autotune to pick persistent reduction or not. - pick good config without benchmarking (e.g. Natalia mentioned checking register spill) - try the idea on matmul so we know what's the potential there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97203 Approved by: https://github.com/ngimel	2023-04-19 00:17:10 +00:00
Shunting Zhang	694ed70e01	[inductor][easy] create a wrap for triton do_bench function (#99216 ) triton PR https://github.com/openai/triton/pull/1513 change the interface of do_bench function. The quantile fields name is changed from 'percentiles' to 'quantiles' and its default value is changed from from (0.5, 0.2, 0.8) to None. This break some inductor code since a caller expects a tuple may get a item. Add a wrapper to maintain the same behavior for inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99216 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-18 00:52:00 +00:00
Kazuaki Ishizaki	f011db345f	Fix typos under torch/_inductor directory (#97592 ) This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97592 Approved by: https://github.com/dagitses, https://github.com/kit1980	2023-04-10 22:53:18 +00:00
Jason Ansel	8fee46693c	Fused attention patterns (#97741 ) Patterns based on https://github.com/pytorch/pytorch/pull/94729 mainly as a forcing function for implementing joint graph replacements. Up until now, we had two places to do pattern matching 1) Pre-grad has janky infra (graph not normalized or functional), but is desirable for many types of passes where you want your change to affect grad formulas. 2) Post-grad has good infra, but cant change grad formulas. This PR adds a third place to do pattern matching: the joint forward+backwards graph. The idea is to take the patterns and lower them to a joint graph and replace both the forwards+backwards before we partition them. This allows us to do something similar to pre-grad transforms, but run after normalization and functionalization. Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97741 Approved by: https://github.com/Chillee	2023-04-10 00:35:22 +00:00
Yu Guo	edebe413d3	[inductor] fix scatter fallback and fallback in deterministic mode (#98339 ) Fixes https://github.com/pytorch/pytorch/issues/93537 add `ir.ScatterFallback` to handle the mutation correctly of scatter/scatter_reduce fallback, also handle the case that `src` is a scalar, and lastly fallback in deterministic mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98339 Approved by: https://github.com/jansel	2023-04-06 19:43:17 +00:00
Yanbo Liang	ccc27bc361	[Inductor] Fix convolution lowering if stride or padding or dilation is 1 element list (#98448 ) Fixes error from 14k github models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98448 Approved by: https://github.com/ngimel	2023-04-06 10:40:06 +00:00
Shunting Zhang	13461e9767	[inductor] more cuda metrics in wrapper (#97723 ) Following metrics should be helpful: - percent of time GPU is busy - percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes - percent of time each individual kernel takes compared to total wall time of the benchmark This PR add those. Example result from hf_Bert infernece graph: ``` == triton_pointwise category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_poi_fused_gelu_6_0d1d 0.48154 12.0 5.52% triton_poi_fused_clone_1_0d1d2 0.29011 24.0 3.33% triton_poi_fused_clone_2_0d1d2 0.17417 12.0 2.00% triton_poi_fused_clone_4_0d1d2 0.10797 12.0 1.24% Total 1.05379 12.08% == triton_persistent_reduction category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_per_fused__softmax__to_ 0.97188 12.0 11.14% triton_per_fused_add_native_la 0.37401 24.0 4.29% triton_per_fused_gelu_native_l 0.02 1.0 0.23% triton_per_fused_add_embedding 0.01718 1.0 0.20% Total 1.38307 15.86% == unknown category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- ampere_fp16_s16816gemm_fp16_12 2.24514 24.0 25.74% ampere_fp16_s16816gemm_fp16_25 1.39796 49.0 16.03% void cutlass::Kernel<cutlass_8 1.36093 1.0 15.61% ampere_fp16_s16816gemm_fp16_64 0.74591 12.0 8.55% ampere_fp16_s16816gemm_fp16_12 0.61989 12.0 7.11% Memset (Device) 0.024 12.0 0.28% void at::native::(anonymous na 0.01543 2.03 0.18% void at::native::vectorized_el 0.00011 0.03 0.00% Total 6.40937 73.49% Percent of time when GPU is busy: 101.44% ``` Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU). ## interesting usages 1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle: ``` (Forward graph) Percent of time when GPU is busy: 32.69% Total wall time 17.307 ms ``` Its backward graph has less percent of GPU idle time, but it's still high: ``` (Backward graph) Percent of time when GPU is busy: 46.70% Total wall time 17.422 ms ``` 2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ). NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately. ``` benchmark_name pointwise_percent reduction_percent persistent_reduction_percent unknown_category_percent GPU_busy_percent wall_time_ms ----------------------- ------------------- ------------------- ------------------------------ -------------------------- ------------------ -------------- resnet18 19.73% 7.86% 4.81% 41.25% 73.65% 2.549ms resnet18 18.59% 7.13% 3.35% 67.35% 96.41% 3.467ms resnet50 29.57% 22.13% 2.07% 51.68% 105.46% 6.834ms resnet50 26.42% 15.27% 0.94% 59.68% 102.31% 13.346ms vgg16 26.23% 0.00% 0.00% 74.20% 100.43% 18.212ms vgg16 15.63% 5.61% 0.10% 79.42% 100.75% 33.485ms BERT_pytorch 28.62% 4.82% 14.88% 33.32% 81.64% 7.162ms BERT_pytorch 14.43% 13.41% 18.19% 49.24% 95.27% 10.395ms densenet121 11.89% 2.14% 3.86% 16.36% 34.25% 16.531ms densenet121 10.37% 2.06% 4.09% 31.46% 47.98% 16.934ms hf_Bert 23.94% 0.00% 29.88% 46.09% 99.90% 7.766ms hf_Bert 11.65% 10.54% 20.26% 61.66% 104.11% 11.892ms nvidia_deeprecommender 42.92% 0.00% 0.00% 56.75% 99.67% 3.476ms nvidia_deeprecommender 31.36% 3.44% 0.46% 65.20% 100.45% 3.872ms alexnet 30.99% 0.00% 0.00% 69.16% 100.14% 3.169ms alexnet 24.41% 4.83% 0.17% 71.09% 100.50% 4.709ms mobilenet_v2 29.21% 27.79% 2.49% 44.00% 103.49% 10.160ms mobilenet_v2 17.50% 15.05% 1.06% 69.68% 103.29% 20.715ms resnext50_32x4d 18.96% 9.28% 2.31% 28.79% 59.33% 5.899ms resnext50_32x4d 18.48% 11.01% 1.86% 53.80% 85.14% 7.167ms mnasnet1_0 19.07% 14.52% 3.01% 35.43% 72.03% 6.028ms mnasnet1_0 14.17% 12.00% 1.87% 67.56% 95.60% 9.225ms squeezenet1_1 38.56% 0.00% 1.77% 56.21% 96.53% 2.221ms squeezenet1_1 21.26% 7.57% 1.05% 67.30% 97.18% 4.942ms timm_vision_transformer 17.05% 0.00% 18.80% 65.79% 101.64% 9.608ms timm_vision_transformer 9.31% 9.07% 10.32% 73.25% 101.96% 16.814ms ``` ## how to use `python {compiled_module_wrapper.py} -p` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723 Approved by: https://github.com/jansel	2023-04-01 08:04:14 +00:00
Yu Guo	1f71ac785c	[RFC][inductor][index_put] fallback to aten in torch deterministic mode (#96898 ) Fixes #93537 fallback to aten for index_put and scatter ops in torch deterministic mode Pull Request resolved: https://github.com/pytorch/pytorch/pull/96898 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-29 19:28:37 +00:00
Shunting Zhang	652592efa9	[inductor] use torch.prifiler in the triton wrapper (#97405 ) I think it's helpful to use torch.profiler to profile the triton wrapper. E.g., I tried it for nvidia_deeprecommender's infernece graph. Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops. <img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png"> next step I'll check why the triton mm kernels are not picked. EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled: <img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405 Approved by: https://github.com/ngimel	2023-03-27 21:54:25 +00:00
Christian Puhrsch	9d37cefcb0	Resubmit _int_mm (#96685 ) Avoids any changes to gemm_and_bias Pull Request resolved: https://github.com/pytorch/pytorch/pull/96685 Approved by: https://github.com/drisspg, https://github.com/ngimel	2023-03-27 16:14:07 +00:00
blzheng	39c8188194	Inductor: fall back bernoulli on cpu (#97002 ) data type: float32 Input size: torch.Size([64, 4, 128, 128]) single socket (32cores): ``` Before: bernoulli 0.001327775239944458 s dropout 0.0014216173489888509 s After: bernoulli 0.0002424612840016683 s dropout 0.00039757410685221353 s ``` single core: ``` Before: bernoulli 0.04154032731056213 s dropout 0.04382548745473226 s After: bernoulli 0.006143261671066284 s dropout 0.0065830423831939695 s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97002 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-24 22:13:51 +00:00
Jason Ansel	9370f253e3	[inductor] Rewrite convolution triton templates (#95556 ) Fixes #95775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95556 Approved by: https://github.com/Chillee, https://github.com/ngimel	2023-03-22 18:12:23 +00:00
Horace He	6dded5d63e	Fixes warning to refer to SMs instead of Cuda Cores (#97224 ) Fixes https://github.com/pytorch/pytorch/issues/97179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97224 Approved by: https://github.com/eellison, https://github.com/voznesenskym	2023-03-21 22:37:31 +00:00
Bin Bao	ea9194a4f2	[inductor] Make the original ATen info dumped in alphabetical order (#97261 ) Summary: To avoid a lot of noises when comparing output_code.py from two runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97261 Approved by: https://github.com/Chillee	2023-03-21 20:34:49 +00:00
Shunting Zhang	13398d8b95	[inductor] improve bandwidth computation (#97057 ) When we compute bandwidth for an kernel, we should double the memory usage for inplace arguments since we need read them once and write them once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97057 Approved by: https://github.com/Chillee	2023-03-20 20:30:46 +00:00
Shunting Zhang	8ce296ae2c	[ez][inductor] show kernel category in kernel benchmark result (#96991 ) I feel it's useful to show if an kernel is pointwise/reduction/persistent_reduction in the benchmark output. Only print the upper case of the first 3 letters to avoid wrap the line: - POI for pointwise - RED for reduction - PER for persistent_reduction <img width="1091" alt="Screenshot 2023-03-16 at 5 10 21 PM" src="https://user-images.githubusercontent.com/52589240/225780546-07b8d345-2bbe-40bd-9e65-185e9294743e.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96991 Approved by: https://github.com/Chillee	2023-03-17 17:02:43 +00:00
Christian Puhrsch	0a53c9624a	Back out "Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339 )" (#96885 ) Summary: Backing out _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339) Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/96885 Approved by: https://github.com/drisspg	2023-03-16 05:32:55 +00:00
Zachary DeVito	3162f71787	[memory debugging] Extract frame information from inductor (#95753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753 Approved by: https://github.com/Chillee	2023-03-16 04:12:54 +00:00
Edward Z. Yang	3606f59366	Default specialize_int to False (#96624 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624 Approved by: https://github.com/janeyx99	2023-03-16 02:54:18 +00:00
Yanbo Liang	e7d795dccd	[Inductor] aten.{avg_pool2d/max_pool2d_with_indices} arguments can be 1 element tuple (#96727 ) Fixes failure from 14k github models: ```pytest ./generated/test_ProGamerGov_neural_dream.py -k test_000``` Error: ``` ...... File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 357, in call_function raise LoweringException(e, target, args, kwargs).with_traceback( File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 354, in call_function out = lowerings[target](args, kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 228, in wrapped out = decomp_fn(args, **kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 3124, in avg_pool2d assert len(padding) == 2 torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: AssertionError: target: aten.avg_pool2d.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda', torch.float32, size=[4, 4, 64, 64], stride=[16384, 4096, 64, 1])) )) args[1]: [7, 7] args[2]: [1, 1] args[3]: [0] args[4]: False args[5]: False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96727 Approved by: https://github.com/jansel	2023-03-14 21:34:30 +00:00
PyTorch MergeBot	ba4fb9b6ad	Revert "Default specialize_int to False (#96624 )" This reverts commit `1ac8782db2`. Reverted https://github.com/pytorch/pytorch/pull/96624 on behalf of https://github.com/kit1980 due to Broke inductor/test_torchinductor_dynamic_shapes.py	2023-03-14 19:43:47 +00:00
Edward Z. Yang	1ac8782db2	Default specialize_int to False (#96624 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624 Approved by: https://github.com/janeyx99	2023-03-14 18:37:47 +00:00
Horace He	2a08a62777	Add extra metadata (as comments) to Inductor generated code (#96581 ) New output <img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581 Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym	2023-03-14 03:59:59 +00:00
Nicolas Macchioni	f673ad6d5c	Add a new knob to separately enable the autotuning in Triton. (#96440 ) Summary: separate triton pointwise autotune from matmul autotune, work done by ckluk Test Plan: sandcastle + CI Differential Revision: D43955699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96440 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-13 19:09:27 +00:00

1 2

94 Commits