pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	74e6c877e9	Revert "[inductor] Memory planning (#112178 )" This reverts commit `f64a97c6f8`. Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too `f64a97c6f8` ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))	2023-11-01 00:03:56 +00:00
Jez Ng	f64a97c6f8	[inductor] Memory planning (#112178 ) This was originally @jansel's PR: https://github.com/pytorch/pytorch/pull/102625, which I've built upon. This diff implements static memory planning. It's disabled by default while we examine its performance. We use a greedy-by-size approach. For dynamic shapes, the sizes of the example inputs are used as estimates when making planning decisions. We generate expressions to calculate the actual memory offsets and sizes at runtime when the values of the dynamic shapes are known. In order to simplify these calculations, we have organized the allocations into a tree that branches on space (address offsets) and time (live ranges). Finally, we need to align these offsets, so we have added an `align` sympy Expr to express these calculations. Some limitations: 1. It is only enabled during inference for now. Enabling it for training increases peak memory usage as we allocate all the memory needed for training upfront, before freeing the memory allocated during inference. We can probably address this by doing planning for both the inference and training passes together. 2. It doesn't work with PyTorch Distributed, because kernels like AllGatherIntoTensor codegen strings which do memory operations. We can fix this down the line by having them emit MemoryPlanningLines instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-10-31 20:02:30 +00:00
Shunting Zhang	fbafff3668	[reland][inductor] benchmark fusion (#112450 ) reland https://github.com/pytorch/pytorch/pull/108193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450 Approved by: https://github.com/jansel	2023-10-31 18:17:06 +00:00
PyTorch MergeBot	fc0b0820fc	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `b110d87ac2`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1785922905))	2023-10-30 19:45:41 +00:00
chilli	b110d87ac2	Readded device_assert skipping in index and index_put (and also added (#112093 ) copy to noop pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093 Approved by: https://github.com/oulgen, https://github.com/lezcano	2023-10-27 18:23:49 +00:00
PyTorch MergeBot	64fd027f2e	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `73cc5d1cdd`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))	2023-10-27 01:40:06 +00:00
PyTorch MergeBot	0a3199dd7e	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `e38347f490`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/izaitsevfb due to Sorry, trying to resolve a conflict with intern, and unblock the revert of #108690 ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1782154814))	2023-10-27 01:37:33 +00:00
Shunting Zhang	73cc5d1cdd	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 22:18:37 +00:00
PyTorch MergeBot	485cc0faae	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `ec0cdcdf6a`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))	2023-10-26 16:41:20 +00:00
chilli	e38347f490	Readded device_assert skipping in index and index_put (and also added (#112093 ) copy to noop pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093 Approved by: https://github.com/oulgen, https://github.com/lezcano ghstack dependencies: #111990	2023-10-26 07:54:44 +00:00
Shunting Zhang	ec0cdcdf6a	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 04:14:22 +00:00
Guilherme Leobas	f97c2dabd9	Move negative index checking to common.py - Fix issue 97365 (#108690 ) Fixes https://github.com/pytorch/pytorch/issues/97365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108690 Approved by: https://github.com/lezcano	2023-10-24 17:27:54 +00:00
Peter Bell	dc794ec32c	[dynamo] Trace through builtin `abs` (#110398 ) In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do the same in dynamo. This also adds `SymNode.__abs__` so we can trace through indexing expressions involving `abs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398 Approved by: https://github.com/jansel, https://github.com/lezcano	2023-10-03 19:25:37 +00:00
leslie-fang-intel	7eeb392eb3	[Inductor] Enable the item() and nonzero() codegen test on CPU (#110262 ) Summary Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes: - Current implementation of `rename_indexing` `10c646295d/torch/_inductor/codegen/common.py (L1023)` only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`. - Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition. Test Plan ``` python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262 Approved by: https://github.com/ezyang, https://github.com/jgong5	2023-09-30 00:13:20 +00:00
Sam Larsen	7ed06e8317	[inductor] enable mypy checking in torch/_inductor/codegen/cpp.py (#109729 ) Summary: Add enough typehints / ignores to enable mypy checking in torch/_inductor/codegen/cpp.py Test Plan: lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/109729 Approved by: https://github.com/Skylion007	2023-09-25 22:53:05 +00:00
Ying Zhang	bbdce93571	Basic fp8 support in Inductor (#109168 ) Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from `10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10)`. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168 Approved by: https://github.com/drisspg	2023-09-23 04:41:41 +00:00
Yang Chen	1c4e811565	replace data_ptr with aoti_torch_get_data_ptr for cpp codegen (#109615 ) Summary: in cpp codege, we should use aoti_torch_get_data_ptr for retrieving aten tensor pointers if abi_compatible is true Test Plan: ci Reviewed By: bertmaher Differential Revision: D49411392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109615 Approved by: https://github.com/bertmaher, https://github.com/desertfire, https://github.com/jansel	2023-09-20 17:26:17 +00:00
Ying Zhang	d8da2a7c85	Switch to CUDA event based profiling (#109338 ) In https://github.com/pytorch/pytorch/pull/107901, the CUDA event based profiling is changed to profiler based profiling to avoid counting CPU-side kernel launch overhead in final latency numbers. However, it turns out that torch.profile() is significantly slower than CUDA event which affects model compilation speed quite significantlly. This PR changes back to CUDA event based profiling. Follow-ups: * Try CUDA event profiling with CUDAGraphs; * Multi-GPU profiling; Pull Request resolved: https://github.com/pytorch/pytorch/pull/109338 Approved by: https://github.com/frank-wei	2023-09-17 06:04:41 +00:00
PyTorch MergeBot	800c665618	Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 )" This reverts commit `5976a08eea`. Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/peterbell10 due to This combined with #108803 uncovered a triton bug openai/triton#2298 ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1719811113))	2023-09-14 16:58:52 +00:00
Ying Zhang	097fd43f8c	[Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931 ) This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901	2023-09-12 17:44:38 +00:00
Peter Bell	5976a08eea	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation. Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano, https://github.com/atalman	2023-09-11 18:44:10 +00:00
PyTorch MergeBot	8ba23e48fa	Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 )" This reverts commit `53a27021c5`. Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/atalman due to Sorry for reverting your change, but it broke rocm CI ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1710776610))	2023-09-07 21:13:42 +00:00
Peter Bell	53a27021c5	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano	2023-09-07 17:40:45 +00:00
Shunting Zhang	7cb4bf675b	[inductor] no-side-effect codegen (#107617 ) Inductor kernel codegen previously have the following side effect: - in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers - during codegen, we do memory allocation/free. These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617 Approved by: https://github.com/jansel	2023-08-31 00:25:17 +00:00
Shunting Zhang	556bfe7cb5	[inductor] let codegen not rely on node order (#107320 ) We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107320 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-30 02:34:20 +00:00
lezcano	2b6249e209	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-23 11:59:20 +00:00
PyTorch MergeBot	b282787409	Revert "Wrap indirect indexing on CUDA (#105055 )" This reverts commit `85c673e6b2`. Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))	2023-08-22 20:24:41 +00:00
lezcano	85c673e6b2	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-22 01:06:35 +00:00
Peter Bell	18b1c2907d	[inductor] Add ir.WelfordReduction with multiple outputs (#104725 ) This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously. Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725 Approved by: https://github.com/lezcano	2023-08-18 08:18:01 +00:00
XiaobingSuper	df6aaf7bc2	inductor: fix compile error for inplace variable multi-defined (#106852 ) When removing an inplace buffer, we just mark it as ```REMOVED```, after removing some inplace buffer, and then if we mark a buffer as inplace buffer using the ```self.inplace_buffer.values()``` length to create a buffer name, there may have an issue which we may define a same inplace buffer name with existed in ```self.inplace_buffer.values()```: before removing some inplace buffers, the ```self.inplace_buffers``` may be like: ``` {'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf7': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf9': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf12': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf13': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf25': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf20': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf26': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf31': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf32': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32'])} ``` After removing some inplace buffers, the ```self.inplace_buffers``` may be like: ``` {'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': 'REMOVED', 'buf7': 'REMOVED', 'buf9': 'REMOVED', 'buf12': 'REMOVED', 'buf13': 'REMOVED', 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': 'REMOVED', 'buf25': 'REMOVED', 'buf20': 'REMOVED', 'buf26': 'REMOVED', 'buf31': 'REMOVED', 'buf32': 'REMOVED', 'buf16': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38']), 'buf38': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38'])} ``` And then if we mark some buffer as inplace buffer and the buffer name will use ```in_out_ptr{len(unique(self.inplace_buffers.values()))}```, the buffer name may be ```in_out_ptr6``` even this name has existed in ```self.inplace_buffers```. After this PR, we will change ```REMOVED``` to ```REMOVED{1, 2, 3..}``` which avoids defining a duplicate name. ```pyhpc_equation_of_state ``` of ```torchbench``` will work for CPU backend: ```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --performance --inference --float32 -dcpu -n50 --inductor --freezing --no-skip --dashboard --only pyhpc_equation_of_state --cold_start_latency``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106852 Approved by: https://github.com/lezcano	2023-08-11 04:06:58 +00:00
Ashok Kumar Kannan	32d8de23d4	Enable mypy check for torch/_inductor/codegen/common.py (#106199 ) Fixes #105230 Summary: As suggested in [#105230](https://github.com/pytorch/pytorch/issues/105230) mypy checking is enabled in torch/_inductor/codegen/common.py. After the fix: `mypy --follow-imports=skip torch/_inductor/codegen/common.py Success: no issues found in 1 source file` Reviewers: @eellison Pull Request resolved: https://github.com/pytorch/pytorch/pull/106199 Approved by: https://github.com/Skylion007, https://github.com/eellison	2023-08-08 20:37:47 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit `05bd24bb35`. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00
haozhe.zhu	60237ccbdf	fix bf16 constant accuracy (#105827 ) This PR aims to sort out the data type for `constant`. The constant should be promoted to float https://github.com/pytorch/pytorch/pull/105440. So there are serval changes to do: - Data type propagation should propagate constant node to `float` dtype if original dtype is `bfloat16` - We do not need to insert `to_dtype` after the `constant` node, directly init an `fp32` constant is faster. ``` vectorized<bfloat16> tmp(value); vectorized <float> tmp1 = cvt_bf16_fp32(tmp); -> vectorized<float> tmp(value); ``` - move `constant` out of the list for `all operations can support bf16 without converting to fp32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105827 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-03 01:17:50 +00:00
Wang, Eikan	05bd24bb35	Extend Inductor to support the third-party backend (#100706 ) This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done. Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code. - Python wrapper code generation Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions. - Kernel code generation It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions. - [group_fn](`71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64)`) - [flush](`71c4becda7/torch/_inductor/scheduler.py (L1150)`) - [can_fuse_vertical](`71c4becda7/torch/_inductor/scheduler.py (L1006)`) - [can_fuse_horizontal](`71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64)`) - [codegen_template](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._ - [codegen_nodes](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) - [codegen_sync](`71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)`). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._ The third-party backend needs to inherit from the `Scheduling` class and implement these functions. Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706 Approved by: https://github.com/jansel	2023-08-02 05:13:51 +00:00
lezcano	c099b7e07a	ValueRange analysis for indirect indexing (#102611 ) We do so by forwarding ValueRange analysis from IR buffers to CSEvars Pull Request resolved: https://github.com/pytorch/pytorch/pull/102611 Approved by: https://github.com/eellison, https://github.com/peterbell10	2023-07-14 13:43:05 +00:00
Peter Bell	e80787c8e1	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-08 20:48:29 +00:00
David Berard	2df939aaca	[inductor] Update ops.bucketize to take offsets_size as a sympy.Expr (#104756 ) Background/problem: ops.bucketize needs to take a value `offsets_size`, which is the length of the `offsets` tensor. It is used, e.g., for the bounds of the binary search over the `offsets` tensor. The previous implementation of `ops.bucketize` expected `offsets_size` to be a CSEVariable; i.e. we'd pass `offsets_size = ops.index_expr(offsets.get_size()[0])` into `ops.bucketize()`. However, `ops.index_expr` will sometimes broadcast, turning the scalar `offsets_size` into a tensor. That caused errors, because [triton_helpers.bucketize_binary_search](`a2fe6953bc/torch/_inductor/triton_helpers.py (L153-L155)`) expects `offsets_size` to be a scalar. [Link - where the broadcasting happens](`a2fe6953bc/torch/_inductor/codegen/triton.py (L1056)`) Solution (this PR): Instead of passing `offsets_size` into `ops.bucketize` as a CSEVariable, pass in a sympy.Expr. Then, inside ops.bucketize, convert the sympy.Expr into a string that can be used in the generated triton code. Differential Revision: [D47282413](https://our.internmc.facebook.com/intern/diff/D47282413) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104756 Approved by: https://github.com/jansel	2023-07-08 01:08:55 +00:00
PyTorch MergeBot	1280b19827	Revert "[inductor] Split ops.reduction into reduction and store_reduction (#102737 )" This reverts commit `59b8d5be74`. Reverted https://github.com/pytorch/pytorch/pull/102737 on behalf of https://github.com/clee2000 due to sorry but i need to revert this to revert the other one in the stack ([comment](https://github.com/pytorch/pytorch/pull/102737#issuecomment-1624735108))	2023-07-07 04:53:14 +00:00
Peter Bell	59b8d5be74	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-06 16:22:19 +00:00
David Berard	e9d2d74f0a	[inductor] Add prims._inductor_bucketize and add lowerings (#104007 ) TL;DR: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize. Background - torch.bucketize: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2]. Implementation: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in `triton_helpers.py`. This PR also adds a new prim `_inductor_bucketize()` for testing purposes and adds lowering for this op. ~~"right": The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. I'm open to switching the behavior of the inductor op - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize).~~ Switched the behavior of the inductor bucketize op to match the torch op * places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior) + current torch.bucketize behavior + table in [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) * places where "right" means "the right boundary of a bucket is closed": + the text description of [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) (observed in #91580) + [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) (which is basically the same op) Performance: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets). As is: ``` Eager 0.30117499828338623 ms PT2 0.9298200011253357 ms ``` But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456): ``` Eager 0.3015420138835907 ms PT2 0.23028500378131866 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104007 Approved by: https://github.com/jansel	2023-07-03 16:52:38 +00:00
leslie-fang-intel	707d265db2	[Inductor][Quant]Refactor load and store vectorization code generation with uint8 data type (#104075 ) Summary Refactor the vectorization code generation of uint8 input data type. Previously, we combine the uint8 data load and uint8 to float data convert into one step as `load_uint8_as_float` and `store_float_as_uint8`. After refactor, we split them into 2 steps of load/store and data type convert to make the behavior same as BFloat16 data type . The previous generated code is: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 * tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = tmp6 * tmp2; auto tmp8 = tmp7.round(); auto tmp9 = tmp8 + tmp2; auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp11 = at::vec::maximum(tmp9, tmp10); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp13 = at::vec::minimum(tmp11, tmp12); auto tmp14 = (tmp13); at::vec::store_float_as_uint8(tmp14, out_ptr0 + static_cast<long>(i0)); } ``` After this PR, the generated code is: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<uint8_t>::loadu(in_ptr0 + static_cast<long>(i0), 16); auto tmp1 = cvt_uint8_to_fp32_with_same_elem_num(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 * tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = tmp6 * tmp2; auto tmp8 = tmp7.round(); auto tmp9 = tmp8 + tmp2; auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp11 = at::vec::maximum(tmp9, tmp10); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp13 = at::vec::minimum(tmp11, tmp12); auto tmp14 = cvt_fp32_to_uint8(tmp13); tmp14.store(out_ptr0 + static_cast<long>(i0), 16); } ``` Test Plan ``` python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104075 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-01 23:12:43 +00:00
Edward Z. Yang	bd5b1788cd	Support printing inequality in ExprPrinter (#104104 ) Fixes https://github.com/pytorch/pytorch/issues/103587 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104104 Approved by: https://github.com/jansel	2023-06-23 21:50:17 +00:00
XiaobingSuper	da21273ad5	inductor: support rsqrt for dynamic shape (#103579 ) Fix compiler error for HF hf_BigBird dynamic shape path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103579 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-15 07:02:18 +00:00
lezcano	5f77be8bbe	Refactor OptimizeIndexing (#100549 ) This PR decouples the logic necessary to compute bounds on variables from the logic that uses this info to perform the strenght analysis on int64 variables. While doing so, it tries to minimize the number of attributes of the class in favour of local variables. This class is now accessible from any `LoopBody` object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100549 Approved by: https://github.com/eellison	2023-06-13 03:31:41 +00:00
haozhe.zhu	adcefcb378	insert to dtype for fused mem copy scheduler node (#101042 ) Fix https://github.com/pytorch/pytorch/issues/100830. For the inplace node, there will be a `copy_` generated and the `copy_` will be `realized` as a `scheduler buffer` since it is a mutation. This `scheduler buffer` is a memory copy but after fusing with the previous buffer, it will not be a memory copy only buffers. This PR solves the issue by removing `load_bf16_as_fp32` and `store_bf16_from_fp32`. Instead, enable fp32/bf16 vec conversion in `to_dtype`. Then we always store bf16. ```python import torch import torch.nn as nn torch.manual_seed(420) from torch._inductor import config x = torch.randn(1, 18, dtype=torch.bfloat16) class ExampleModel(nn.Module): def __init__(self): super(ExampleModel, self).__init__() self.relu = nn.ReLU(inplace=True) # nn.ReLU(inplace=False) def forward(self, input1): out = self.relu(input1) # input1.copy_(out) return out func = ExampleModel() with torch.no_grad(): func.train(False) res1 = func(x) # without jit print(res1) jit_func = torch.compile(func) res2 = jit_func(x) print(res2) ``` Generated code without this PR: (`tm3` store is wrong, `tmp3` is `float` while `out_ptr1` is `bf16`) ``` auto tmp0 = load_bf16_as_float(out_ptr1 + static_cast<long>(i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = (tmp2); store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp3); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` Generated code with this PR: ``` auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(out_ptr1 + static_cast<long>(i0), 16); auto tmp1 = cvt_bf16_to_fp32(tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = cvt_fp32_to_bf16(tmp2); tmp3.store(out_ptr0 + static_cast<long>(i0), 16); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` This PR also fixed the data type propagation for `masked_subblock`. Before the masked_subblock's dtype is propagated by its input which is wrong. ``` opcode name target args kwargs ----------- --------- --------- -------------------------- -------- call_module masked_subblock1 masked_subblock1 (and__2, -inf) ``` Now we propagated it by subblock with the same name: ``` # graph for body.subblocks['masked_subblock1'] opcode name target args kwargs ----------- --------- --------- -------------------------- -------- placeholder ops ops () {} call_module get_index get_index ('index2',) {} call_method load load (ops, 'arg0_1', get_index) {} call_method to_dtype to_dtype (ops, load, torch.float32) {} output output output (to_dtype,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101042 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-07 15:55:25 +00:00
Edward Z. Yang	f760899864	Teach Triton codegen to generate sqrt (#103084 ) Fixes https://github.com/pytorch/pytorch/issues/100972 I know ngimel doesn't like this sort of fix because we shouldn't actually be computed sqrt at runtime, I'm open to some sort of perf warning saying that we're spending FLOPs weirdly. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103084 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ngimel	2023-06-07 03:03:56 +00:00
Jason Ansel	fc0fed36d9	[inductor] fix issue with ops.lookup_seed (#102485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102485 Approved by: https://github.com/anijain2305	2023-05-29 22:25:47 +00:00
Wang, Eikan	c3550d8376	Add fast path for BF16 kernel if all the operations within the kernel support bf16 (#99814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99814 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-26 02:08:53 +00:00
Natalia Gimelshein	68816e4fa9	Remove inplace buffers when original and mutation are both removed (#102289 ) Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed). This PR extends buffer removal to inplaced buffers that can be removed. Generated kernel for e.g. ln changes from ``` def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` where in_out_ptr0 is unused in the kernel to ``` def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` and corresponding allocation/reuse lines in the wrapper are removed. The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289 Approved by: https://github.com/jansel	2023-05-26 02:06:36 +00:00
FindHao	3e08988cd3	Fix redudant kernel generations (#102104 ) ## Issue description The PR https://github.com/pytorch/pytorch/pull/100064 introduces a new RNG operation process. However, it causes every `randint` to load a separate random seed by default. TorchInductor generates a buffer to store all necessary random seeds and places the offsets as constant values in the subsequent compute buffers. In ir_pre_fusion generated by TorchInductor, some buffers only differ by one line, which is the load random seed with the corresponding offset. Subsequently, the codegen generates Triton kernels following the same rule. Finally, in the output_code.py, some Triton kernels only differ by one line, meaning that redundant kernels are being generated. ## Solution This PR captures the seed offset and adds it to the existing `self.sizevars` structure. It generates variable names as placeholders, allowing the code wrapper to pass the offset as an argument to the kernels. I've also modified the divisible_by_16 check to exclude this argument. This PR reduces the number of generated kernels from 50 to 17 for BertForMaskedLM forward. According to tests on my own environment, the compilation time of attention_is_all_you_need_pytorch has been reduced from 94s to 66s. The speedup remains largely unchanged, at 1.37X. The following is a comparison for a simple example. Before: ``` triton_poi_fused_0 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + 0) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) triton_poi_fused_1 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + 1) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) ...''') def call(args): triton_poi_fused_0.run(buf0, buf1, 1024, grid=grid(1024), stream=stream0) triton_poi_fused_1.run(buf0, buf2, 1024, grid=grid(1024), stream=stream0) ``` After: ``` triton_poi_fused_0 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + load_seed_offset) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) .... def call(args): triton_poi_fused_0.run(buf0, buf1, 0, 1024, grid=grid(1024), stream=stream0) triton_poi_fused_0.run(buf0, buf2, 1, 1024, grid=grid(1024), stream=stream0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102104 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-24 23:56:53 +00:00

1 2 3

111 Commits