pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Yang Chen	9cd4548f01	AOTInductor dynamic shape (#109012 ) Summary: This PR adds dynamic-shape support for AOTInductor * On the runtime/interface side, we added two structs, StaticDimInfo and DynamicDimInfo, to hold values for static and dynamic dimensions, respectively. Dynamic dimensions are tracked by an unordered map field defined in AOTInductorModelBase. At inference time, the inference run method will assign the current real dimensional value to each dynamic dimension before executing any kernel. * On the CUDA wrapper codegen side, we generate dynamic symbols appropriately for shape computations. We simulate kernel launch grids in the C++ land by re-using the grid functions from the Python world. The returned grid configs, which may contain symbolic expressions, are printed out in their C++ forms via the CppPrinter. Note that when dynamic shapes are involved, we have to compute grid configs for each kernel at runtime in the same way as we do for launching the corresponding Triton kernel. Otherwise, we may end up with memory-access failures or mis-computations caused by invalid indices for fetching or storing data in device memory. Differential Revision: D49100472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012 Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475	2023-09-14 08:00:30 +00:00
Jez Ng	d2d36aad6f	Enable typechecking for _inductor/virtualized.py (#108916 ) Also add a few more type annotations to utils.py (some of its functions are called from virtualized.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108916 Approved by: https://github.com/eellison	2023-09-13 13:04:51 +00:00
Ying Zhang	a2d5f13310	[Inductor CUTLASS backend] Step 5: Gemm CUTLASS templates (#108015 ) This is the step 5 to add cutlass as an alternative inductor backend. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108015 Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov ghstack dependencies: #107802, #107847, #107901, #107931	2023-09-12 17:44:38 +00:00
Ying Zhang	097fd43f8c	[Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931 ) This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901	2023-09-12 17:44:38 +00:00
Ying Zhang	b2d764ece0	[Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest (#107901 ) This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107901 Approved by: https://github.com/jansel, https://github.com/aakhundov, https://github.com/kadeng ghstack dependencies: #107802, #107847	2023-09-12 17:44:36 +00:00
Ying Zhang	102fefac21	[Inductor CUTLASS backend] Step 2: CUDACodeCache (#107847 ) This is the step 2 to add cutlass as an alternative inductor backend. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107847 Approved by: https://github.com/jansel, https://github.com/kadeng, https://github.com/aakhundov ghstack dependencies: #107802	2023-09-12 17:44:34 +00:00
David Berard	ed7f9cac91	[inductor] Add CPU-side profiler event names for templates and foreach kernels (#108449 ) This passes in the descriptive kernel name as part of the triton_meta dict that gets passed to the CachingAutotuner, for foreach kernels and templates. Before: <img width="684" alt="Screenshot 2023-09-01 at 11 56 02 AM" src="https://github.com/pytorch/pytorch/assets/5067123/c14e13fc-0d9e-425a-a08b-613ef42aa264"> After: <img width="562" alt="Screenshot 2023-09-01 at 2 13 00 PM" src="https://github.com/pytorch/pytorch/assets/5067123/551bb9a9-865b-401e-b6e0-8ebbe5431565"> This PR also refactors the "magic strings" (KERNEL_NAME and DESCRIPTIVE_KRNL_NAME) into an enum in utils.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108449 Approved by: https://github.com/jansel	2023-09-09 02:11:13 +00:00
Bin Bao	e91f66471c	[reland][inductor] Switch to use the runtime interface for AOTInductor testing (#108878 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/108663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108878 Approved by: https://github.com/muchulee8	2023-09-08 17:58:35 +00:00
Michael Lazos	6c7260407b	Back out "Horizontally fuse input concatenation (#108115 )" (#108793 ) Summary: Original commit changeset: f15956d96311 Original Phabricator Diff: D48996091 Test Plan: Reverting to Unbreak test Differential Revision: D49065517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108793 Approved by: https://github.com/Chillee	2023-09-08 05:14:57 +00:00
PyTorch MergeBot	428f5f9e7e	Revert "[inductor] Switch to use the runtime interface for AOTInductor testing (#108663 )" This reverts commit `366ce589d0`. Reverted https://github.com/pytorch/pytorch/pull/108663 on behalf of https://github.com/Chillee due to Sorry :'( Need to revert to resolve merge conflict for another revert ([comment](https://github.com/pytorch/pytorch/pull/108663#issuecomment-1711076411))	2023-09-08 05:01:27 +00:00
Bin Bao	366ce589d0	[inductor] Switch to use the runtime interface for AOTInductor testing (#108663 ) Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663 Approved by: https://github.com/ezyang ghstack dependencies: #108653	2023-09-07 23:38:11 +00:00
chunyuan	ca9f4222e1	Inductor cpp wrapper: fix codegen of positional args with default value (#108552 ) Fixes https://github.com/pytorch/pytorch/issues/108323. Cpp wrapper has functionality regression on `llama` and `tnt_s_patch16_224` due to recent support of scaled dot product flash attention in inductor. The schema of this OP is as follows: ``` - func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask) ``` For `llama` and `tnt_s_patch16_224`, the OP is called in the below way, where the three positional args with default values are not passed (`float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False`). ```python y = torch.ops.aten._scaled_dot_product_flash_attention.default(x0, x1, x2, scale = 0.125) ``` This PR fixes the cpp wrapper support for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108552 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-09-06 13:15:12 +00:00
Michael Lazos	96d74073f8	Horizontally fuse input concatenation (#108115 ) Fixes https://github.com/pytorch/pytorch/issues/106688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115 Approved by: https://github.com/jansel	2023-09-05 16:55:32 +00:00
PyTorch MergeBot	2c1f0772d5	Revert "Horizontally fuse input concatenation (#108115 )" This reverts commit `5911faeb8f`. Reverted https://github.com/pytorch/pytorch/pull/108115 on behalf of https://github.com/osalpekar due to Broke internal benchmarking job. See [D48890838](https://www.internalfb.com/diff/D48890838) ([comment](https://github.com/pytorch/pytorch/pull/108115#issuecomment-1703546520))	2023-09-02 00:19:00 +00:00
Michael Lazos	5911faeb8f	Horizontally fuse input concatenation (#108115 ) Fixes https://github.com/pytorch/pytorch/issues/106688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115 Approved by: https://github.com/jansel	2023-08-30 21:57:11 +00:00
chilli	39130c7433	Add reinplacing pass for scatters + incremental fake tensor updating (#106192 ) mutation for params) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192 Approved by: https://github.com/jansel, https://github.com/eellison	2023-08-30 20:41:37 +00:00
Shunting Zhang	e68b3ad14f	update triton pin with needed inductor change (#107722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107722 Approved by: https://github.com/jansel, https://github.com/cpuhrsch	2023-08-29 04:31:44 +00:00
Jason Ansel	de5ffa8a3a	[inductor] Add aten.multinomial to disallowed cudagraphs ops (#108105 ) Fixes: ```python CUDA_LAUNCH_BLOCKING=1 ./benchmarks/dynamo/torchbench.py --inference --performance --no-skip --inductor --freezing --only nanogpt_generate loading model: 0it [00:00, ?it/s]number of parameters: 123.69M loading model: 0it [00:07, ?it/s] cuda eval nanogpt_generate ERROR:common:Backend dynamo failed in warmup() Traceback (most recent call last): File "/data/users/jansel/pytorch/torch/_inductor/cudagraph_trees.py", line 1084, in _record static_outputs = model(inputs) File "/data/users/jansel/pytorch/torch/_inductor/codecache.py", line 401, in _run_from_cache return compiled_graph.compiled_artifact(inputs) File "/tmp/torchinductor_jansel/db/cdbk4ip3fucyoccnbnoik2crjpdkliwxll653l7l3wwsxiygmade.py", line 18375, in call buf239 = aten.multinomial.default(buf238, 1) File "/data/users/jansel/pytorch/torch/_ops.py", line 448, in __call__ return self._op(args, *kwargs or {}) RuntimeError: CUDA error: operation not permitted when stream is capturing ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108105 Approved by: https://github.com/eellison ghstack dependencies: #108096, #108087, #108098	2023-08-29 02:58:48 +00:00
Peter Bell	18b1c2907d	[inductor] Add ir.WelfordReduction with multiple outputs (#104725 ) This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously. Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725 Approved by: https://github.com/lezcano	2023-08-18 08:18:01 +00:00
Jez Ng	a815e719e8	Turn on typechecking for _inductor/utils.py (#106252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106252 Approved by: https://github.com/Skylion007	2023-08-18 04:11:34 +00:00
Simon Fan	aca3d1433c	Estimate Scheduler node runtimes (#106426 ) Working as starter task with @Chillee This PR adds a method under BaseSchedulerNode to estimate the node's runtime in seconds. We use a heuristic based approach, first by considering whether the operation is memory bandwidth bounded or compute bounded: - memory bandwidth bounded: we compute the number of bytes that are read/written to - compute bounded: we compute the FLOPS required by the operation One use case could be to be used as a cost model for scheduling: https://github.com/pytorch/pytorch/pull/100762 ``` (pytorch-3.10) [14:08:02] ~/local/pytorch (xmfan/estimate_snode_runtime) > python3 test/inductor/test_perf.py -k EstimateSnodeRuntimeTests [(ExternKernelSchedulerNode(name='buf0'), 400)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 3000), (SchedulerNode(name='buf1'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26), (SchedulerNode(name='buf1'), 7.187055238190188e-09)] .[(ExternKernelSchedulerNode(name='buf0'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26)] .[(ExternKernelSchedulerNode(name='buf0'), 34600)] [(ExternKernelSchedulerNode(name='buf0'), 3.22687496698039e-24)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 7776176)] [(ExternKernelSchedulerNode(name='buf0'), 4.63240241413653e-21)] .[(FusedSchedulerNode(nodes=buf0_buf1), 210)] [(FusedSchedulerNode(nodes=buf0_buf1), 5.030938666733132e-10)] .[(ExternKernelSchedulerNode(name='buf0'), 300)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(SchedulerNode(name='buf0'), 20)] [(SchedulerNode(name='buf0'), 4.7913701587934585e-11)] . ---------------------------------------------------------------------- Ran 10 tests in 14.311s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106426 Approved by: https://github.com/Chillee	2023-08-17 17:23:30 +00:00
Shunting Zhang	ce608712cb	[inductor] don't cache non-static content (#106502 ) I happened to find that inductor may cache stale inner_fn_str and ReadWrites object in a ComputedBuffer when I work on looping ordering. Let's say we have producer buffer buf0 and consumer buffer buf1. Before we call GraphLowering.finalize, the layout for buf0 may be a FlexibleLayout. At that moment, the inner_fn_str or ReadWrites object computed for buf1 will be based on the layout of buf0 which most likely is a contiguous FlexibleLayout. And they will be cached on buf1 object (or buf1.data). However after we call GraphLowering.finalize, we may realize it's better to give a non-contiguous layout for buf0 (e.g., if its input has non-contiguous layout or whatever reason). The layout change of buf0 should affect the inner_fn_str and ReadWrites object for buf1. But we may have cached those on buf1. The stale ReadWrites objects for buf1 may result in sub-optimal strides for buf1. This may affect perf and I'll check the nightly runs. Here is a dump of `nodes` in `Scheduler.__init__` before the fix as a reference: https://gist.github.com/shunting314/ed2152a08e268f5563fd55398b1392c7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106502 Approved by: https://github.com/jansel	2023-08-03 22:09:58 +00:00
Ying Zhang	98956c5320	Support dynamic shapes in TritonTemplates (#105295 ) Currently when dynamic=True, TritonTemplates won't be used, as the condition `if list(call_args) != expected_args` defined in `TritonTemplate` cannot be satisfied. This PR tries to fix this issue by allowing passing symbolic variable names via `extra_args` and replacing all symbolic values in the generated TritonTemplate code as call_arg names. With this change, a locally compiled mm + epilogue node calls into the Triton kernel successfully. This PR also introduces a new config "max_autotune_gemm_backends" to allow specifying candidate gemm backends for max autotune. Current choices: combinations of ATEN, TRITON. This makes tests easier, so that we can explicitly test Triton gemm kernels + epilogue fusions + dynamic shapes, without falling back to ATen ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105295 Approved by: https://github.com/jansel	2023-07-25 01:41:25 +00:00
SherlockNoMad	a44f8894fa	[Inductor] Provenance tracking for wrapper code (#105717 ) Summary: Add comments in wrapper code for better provenance tracking Sample inductor wrapper output: ``` # Source Nodes: [mm_1], Original ATen: [aten.mm] extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1) # Source Nodes: [l__self___linear], Original ATen: [aten.addmm] extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0) ``` in cpp wrapper ``` // Source Nodes: [bmm_1], Original ATen: bmm at::bmm_out(buf0, arg0_1, arg1_1); ``` Test Plan: OSS CI Differential Revision: D47657260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-07-21 23:06:43 +00:00
Shunting Zhang	1e87778552	[inductor] refactor wrapper benchmark code out of utils.py (#105584 ) Refactor wrapper benchmark out of utils.py since 1. utils.py gets too large 2. I plan to add more code to wrapper benchmark for multi-kernel. This is split out from https://github.com/pytorch/pytorch/pull/103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584 Approved by: https://github.com/jansel	2023-07-21 00:01:35 +00:00
Justin Chu	cb7a30f656	[BE] Enable ruff's UP rules and autoformat inductor/ (#105431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431 Approved by: https://github.com/albanD	2023-07-19 13:45:00 +00:00
Edward Z. Yang	2fa7d11b64	Immediately compile backwards graph in AOTAutograd if dynamic shapes (#104971 ) Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104971 Approved by: https://github.com/Chillee	2023-07-17 15:37:17 +00:00
chunyuan	1fdc88f877	Inductor cpp wrapper: fix codegen of FallbackKernel with kwargs (#104575 ) Fix cpp wrapper failure on TorchBench model `hf_Reformer` with `randn`: ``` random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype) ``` For cpp wrapper, when `kwargs` is not empty, for `OpOverloadPacket` kernel, we need to know the exact overload schema to handle the `kwargs` properly when calling the cpp kernel: including finding the correct order of the kwargs and getting the default value for optional args without provided value when calling the function (`layout` in the above case). The current support in this PR is conservative and we'll extend the functionality in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104575 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-15 03:33:44 +00:00
Peter Bell	66fb83293e	[inductor] Add min/max to index propagation pass (#105020 ) This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing into direct indexing expressions. I also add support to the cpp printer for Min/Max and fix the triton printer to support multi-argument Min/Max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020 Approved by: https://github.com/lezcano	2023-07-12 19:03:01 +00:00
lezcano	7ae100628e	Move most SymPy functions to their own file (#104556 ) All these are standalone implementations of some functions and they don't depend on anything else, so we better have them under the `_sympy/` folder on their own Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556 Approved by: https://github.com/ezyang	2023-07-04 03:53:48 +00:00
Shunting Zhang	98f00f881f	[inductor] convert layout of conv weight ahead of time for inference (#103642 ) This PR handles inference. Will do similar thing for training later. Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one). - convmixer: 4.285x -> 4.309x - resnet50: 2.170x -> 2.203x The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing. Commands ``` TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103642 Approved by: https://github.com/eellison	2023-06-28 17:42:32 +00:00
lezcano	5f77be8bbe	Refactor OptimizeIndexing (#100549 ) This PR decouples the logic necessary to compute bounds on variables from the logic that uses this info to perform the strenght analysis on int64 variables. While doing so, it tries to minimize the number of attributes of the class in favour of local variables. This class is now accessible from any `LoopBody` object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100549 Approved by: https://github.com/eellison	2023-06-13 03:31:41 +00:00
Animesh Jain	58d2c66a70	[activation checkpointing] Higher order functional rng op wrappers (#102934 ) Introduces two higher order operators * run_and_save_rng_state - Saves the current rng state and then runs the op. * run_with_rng_state - Runs the op with the rng state supplied as an input Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934 Approved by: https://github.com/jansel, https://github.com/zou3519	2023-06-12 22:54:17 +00:00
Shunting Zhang	86c7652503	[inductor] layout optimization for conv (#99773 ) convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773 Approved by: https://github.com/jansel	2023-06-02 21:08:18 +00:00
Michael Lazos	80f7264804	Foreach kernel codegen in inductor (#99975 ) [design doc](https://docs.google.com/document/d/1JLr5yMAR8TuKW78ixKeqzfDHhcazwxKo_JXQnP_-wyY/edit?kh_source=GDOCS#heading=h.8x4z4mmet3im) Add foreach kernel codegen for a single overload of foreach add in Inductor. Coverage will expand to more ops in subsequent PRs. [example](https://gist.github.com/mlazos/9606fe64100ea2a5ec8265df1739fbe2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99975 Approved by: https://github.com/jansel	2023-05-25 21:48:41 +00:00
Peter Bell	ce42010722	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-24 22:17:32 +00:00
PyTorch MergeBot	5147fe4969	Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 )" This reverts commit `b9721bd705`. Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))	2023-05-23 23:06:21 +00:00
Peter Bell	b9721bd705	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-22 20:39:18 +00:00
Jiong Gong	6f7ebcdcd8	[inductor] enable descriptive name for cpp kernels (#101330 ) This PR enables the descriptive name for cpp kernels similar to the triton kernel name. A new configuration `config.cpp.descriptive_names` is added similar to that of triton. The kernel name follows the format: `cpp_<fused_name>_<id>`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101330 Approved by: https://github.com/XiaobingSuper, https://github.com/jansel	2023-05-16 06:48:11 +00:00
chunyuan	1faef895ca	Inductor cpp wrapper: support sympy.Expr as input (#101257 ) Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257 Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel	2023-05-15 23:57:28 +00:00
Edward Z. Yang	2c786961b7	Towards making torch._inductor.ir typed (#100712 ) This PR just contains some mild gyrations necessary to appease mypy. However, it is not complete; there are a number of legitimate bugs and mistyping that I need to work out before I can actually turn this on. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100712 Approved by: https://github.com/ngimel	2023-05-12 00:07:33 +00:00
Jason Ansel	e3d783c013	[inductor] Cleanup strip_last_size logic (#100305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305 Approved by: https://github.com/ngimel	2023-05-05 23:10:47 +00:00
Angela Yi	3c5ec6af14	Partition modules (#98628 ) Added helper functions to match nodes in the graph that are decomposed from their source (leaf modules, or functional ops), as a result of dynamo tracing. `get_source_partitions(graph: torch.fx.Graph, wanted_sources: List[Any]) -> Dict[Any, SourcePartition]` Args: * graph: The graph we want to partition * wanted_sources: List of sources of nodes that were decomposed from this source. This can be a function (ex. torch.nn.functional.linear) or a leaf module type (ex. torch.nn.Linear) Returns: * Dictionary mapping sources (ex. torch.nn.modules.linear.Linear) to a list of SourcePartitions that correspond to the list of nodes that were flattened from a module of that type. ``` @dataclass class SourcePartition(): # Nodes in a particular partition nodes: List[Node] # Module type module_type: Type # Nodes in the graph that are needed as inputs to the partition input_nodes: List[Node] = field(default_factory=list) # Nodes in the partition that are being used by nodes outside of the partition output_nodes: List[Node] = field(default_factory=list) # Parameters that are being used params: List[str] = field(default_factory=list) ``` Example: Original: ``` x -> linear -> linear -> relu -> linear ``` Traced graph: ``` .graph(): %arg0 : [#users=1] = placeholder[target=arg0] %_param_constant0 : [#users=1] = get_attr[target=_param_constant0] %t_default : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0,), kwargs = {}) %_param_constant1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1, %arg0, %t_default), kwargs = {}) %_param_constant0_1 : [#users=1] = get_attr[target=_param_constant0] %t_default_1 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0_1,), kwargs = {}) %_param_constant1_1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default_1 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1_1, %addmm_default, %t_default_1), kwargs = {}) %relu_default : [#users=1] = call_function[target=torch.ops.aten.relu.default](args = (%addmm_default_1,), kwargs = {}) %_param_constant2 : [#users=1] = get_attr[target=_param_constant2] %t_default_2 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant2,), kwargs = {}) %_param_constant3 : [#users=1] = get_attr[target=_param_constant3] %addmm_default_2 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant3, %relu_default, %t_default_2), kwargs = {}) return [addmm_default_2] ``` Result of `get_module_partitions`: ``` {<class 'torch.nn.modules.linear.Linear'>: [ ModulePartition(nodes=[_param_constant0, t_default, _param_constant1, addmm_default], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[arg0], output_nodes=[addmm_default], params=["_param_constant0", "_param_constant1"]), ModulePartition(nodes=[_param_constant0_1, t_default_1, _param_constant1_1, addmm_default_1], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[addmm_default], output_nodes=[addmm_default_1], params=["_param_constant0_1", "_param_constant1_1"]), ModulePartition(nodes=[_param_constant2, t_default_2, _param_constant3, addmm_default_2], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[relu_default], output_nodes=[addmm_default_2], params=["_param_constant2", "_param_constant3"])], <class 'torch.nn.modules.activation.ReLU'>: [ ModulePartition(nodes=[relu_default], module_type=<class 'torch.nn.modules.activation.ReLU'>, input_nodes=[addmm_default_1], output_nodes=[relu_default], params=[])]} ``` Also added helper function to check if two module partitions are connected: `check_subgraphs_connected(subgraph1: SourcePartition, subgraph2: SourcePartition) -> bool` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98628 Approved by: https://github.com/cccclai	2023-05-03 23:31:56 +00:00
PyTorch MergeBot	34e90b8df1	Revert "[inductor] Cleanup strip_last_size logic (#100305 )" This reverts commit `de7793d577`. Reverted https://github.com/pytorch/pytorch/pull/100305 on behalf of https://github.com/jansel due to causes IMA errors on huggingface ([comment](https://github.com/pytorch/pytorch/pull/100305#issuecomment-1532317310))	2023-05-03 00:42:48 +00:00
Jason Ansel	de7793d577	[inductor] Cleanup strip_last_size logic (#100305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305 Approved by: https://github.com/ngimel	2023-05-02 23:46:26 +00:00
Edward Z. Yang	f093ee1722	Prevent Triton from getting eagerly imported when importing torch._inductor (#100374 ) This makes 'import torch._inductor.utils' go from 3.5s to 2.1s See also https://github.com/openai/triton/issues/1599 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:12 +00:00
Yanbo Liang	08376cc546	[Inductor] Fix rand_like with kwargs device of str type (#99673 ) Fixes #99632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99673 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-21 20:33:14 +00:00
Shunting Zhang	418a9fb9d8	[reland][inductor] coordinate descent tuning upon max-autotune (#99594 ) Reland https://github.com/pytorch/pytorch/pull/97203 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/99594 Approved by: https://github.com/jansel	2023-04-20 19:55:52 +00:00
PyTorch MergeBot	4aedb8e116	Revert "[inductor] coordinate descent tuning upon max-autotune (#97203 )" This reverts commit `52ecc3274b`. Reverted https://github.com/pytorch/pytorch/pull/97203 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks MacOS test in trunk	2023-04-19 02:33:02 +00:00
Shunting Zhang	52ecc3274b	[inductor] coordinate descent tuning upon max-autotune (#97203 ) Command to run max autotune baseline: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Command to do coordinate descent autotuning: ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Explanation of the envvars show up on the command: ``` - TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 : enable coordinate descent tuning - TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 : disable persistent reduction. Need do this so we can tune RBLOCK for reductions - TORCHINDUCTOR_MAX_AUTOTUNE=1: enable max autotune - TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc : use a separate cache dir for coordinate descent tuning. Optional. ``` Here are my experiments results for around 40 torchbench models: https://docs.google.com/spreadsheets/d/1G7i2whIf8Yu-HhN_WovNxwcE-iFDSAw4x3NK4uL4XhI/edit#gid=0 Some highlights - We improve 2.2% further upon max-autotune on average (geomean) - timm_resnest benefits most from coordinate descent tuning. There is 1.07x speedup - We have descent speedup on transformer models - BERT_pytorch: 1.056x - timm_vision_transformer: 1.04x - hf_Bert: 1.030x - For resnet models, it looks like we have less gain as model get larger. My guess is larger model spend more time on mm/conv, so our tuning for pointwise/reduction helps less - resnet18: 1.021x - resnet50: 1.014x - resnet152: 1.005x This kind of coordinate descent autotuning can give us 'upper bound' of the gain we can get for tuning configs for pointwise/reduction. On the other hand, by spot checking, we roughly double the compilation time compared to max-autotune. Next steps can be - we disable persistent reduction in coordinate descent autotune (it's still enabled in baseline) so we can tune RBLOCK for reduction. We can also try to use autotune to pick persistent reduction or not. - pick good config without benchmarking (e.g. Natalia mentioned checking register spill) - try the idea on matmul so we know what's the potential there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97203 Approved by: https://github.com/ngimel	2023-04-19 00:17:10 +00:00

1 2 3

117 Commits